sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Arenas...@ckarenas.com>
Subject Submitting Sqoop jobs in parallel
Date Mon, 02 Mar 2015 20:24:51 GMT
Hi team,

I'm building an ETL tool that requires me to pull in a bunch of tables from a db into HDFS
and I'm currently doing this sequentially using Sqoop. I figured it might be a faster to submit
the Sqoop jobs in parallel, that is with a predefined thread pool (currently trying 8) because
it took about two hours to ingest 150 tables of various sizes, frankly not very big tables
as this is POC. So sequentially this works fine, but as soon as I add parallelism, roughly
75% of my Sqoop jobs fail, and I'm not saying that they don't ingest any data, simply that
the data gets stuck in the staging area (I.e /user/username) as opposed to the proper hive
table (I.e /user/username/Hive/Lab). Has anyone experienced this before? I figure I may be
able to shoot a separate process that moves the hive tables from the staging area into the
hive table area, but I'm not sure if that process would simply be to move the tables or if
there is more involved. 


Specs: HDP 2.1, Sqoop


View raw message