Ingest data using Sqoop: The Normal vs The Smart

In the world of Spark, IOT data & other streaming data sources, I know its a little late to talk about ingesting data using sqoop from an RDBMS data source. But after having worked with both spark & sqoop, and if you are still in Hadoop rather than a cloud platform specific connector, I believe sqoop still got a role to play.

We will now talk a little bit about The Normal way of using Sqoop & The Smart way and see some performance metrics around it.

The Normal: We specify the connection parameters and number of mappers and specify a location for landing data. Data get extracted and stored there in text format(with a choice of delimiter to separate columns). Now we know that text format is one of the least efficient in terms of space utilization & IO performance. Still a was majority of sqoop ingestions are using default data format, which is text.

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true -Dmapreduce.job.user.classpath.first=true  --connect "<service>" --username <user name>--password ************ --query "SELECT col1,col2,col3 FROM db1.tab1 WHERE \$CONDITIONS " --split-by col1 --num-mappers 8 --fields-terminated-by '?' --target-dir /user/data/db1/tab1/op_date_ts202005040946 --null-string '\\N' --null-non-string '\\N' --hive-drop-import-delims

here its ingesting data from oracle to following hdfs path /user/data/db1/tab1/op_date_ts202005040946. Data gets saved in 8 files.

The Smart: We save sqoop data in parquet format. Parquet being columns oriented, is extremely efficient the savings structured data. We make use of — as-parquetfile parameter to save the sqooped files as parquet.

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true -Dmapreduce.job.user.classpath.first=true -Doraoop.timestamp.string=false --connect "<service>" --username <user name>--password ************ --query "SELECT col1,col2,col3 FROM db1.tab1 WHERE \$CONDITIONS " --split-by col1 --num-mappers 8 --fields-terminated-by '?' --target-dir /user/data/db1/tab1/op_date_ts202005040946 --null-string '\\N' --null-non-string '\\N' --hive-drop-import-delims --as-parquetfile

Here Doraoop.timestamp.string=false will avoid converting oracles date and timestamp data types to string, but rather a number. The data gets saved at a compressed rate by the order of 6 times. Now many organisations ingestion strategies keep the ingested raw data for future reference/machine learning purposes. Ingesting and archiving in parquet will help save on space & future IO operations significantly.

Stats for nerds: Comparing the performance between savings as text and sqoop.

Here total space utilization for parquet is only 15% when compare to text. There is slight deterioration is the overall time taken which gets easily compensated in the time savings we will have during subsequent reads. From my experience, subsequent reads are faster by an order of 2 times if reading all columns and faster by 10 to 100 times if reading only few columns.

So what’s the Catch? Why not everyone by default using Parquet.

  1. Few of the format handling and options available while in sqooping using text is not available while using as-parquet. E.g. Partition folder structure for target path, timestamp and decimal handling
  2. Not much happening in Sqoop world, so people built/initially setup frameworks and standards in organizations moved over to spark or cloud technologies.

Tips for getting started with sqoop with parquet

  1. If you are using older version and folder(partitions) are not supported during sqoop import, move(mv) the data to desired folder structure as a post process.
  2. Date and timestamp from oracle(may be from other RDBMS as well): Specify following parameter “-Doraoop.timestamp.string=false” to save value as int. Then in external table define the field as bigint and convert it to desired format using from_unixtime(bigint(createts/1000)).
  3. Number and decimal fields by default get saved as string. So in external table define the fields as string. All further operations string can casted to a decimal or get auto casted based on target in most cases.

So should you move to using parquet? We it depends your specific situation. Do you have large amounts of data getting archived in raw zone and is going to be accessed for data science or analytics? Then probably the answer is yes.

A Tech Enthusiast working on Big Data & Analytics.