sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Kemper (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SQOOP-3130) Sqoop (export + --export-dir + Avro files) is not obeying --num-mappers requested
Date Wed, 01 Feb 2017 21:56:51 GMT
Markus Kemper created SQOOP-3130:
------------------------------------

             Summary: Sqoop (export + --export-dir + Avro files) is not obeying --num-mappers
requested
                 Key: SQOOP-3130
                 URL: https://issues.apache.org/jira/browse/SQOOP-3130
             Project: Sqoop
          Issue Type: Bug
            Reporter: Markus Kemper


When using Sqoop (export + --export-dir + Avro files) it is not obeying --num-mappers requested,
instead it is creating a map task per Avro data file.  

One known workaround for this issue is to use the Sqoop --hcatalog options.

Please see the test case below demonstrating the issue and workaround.

*Test Case*
{noformat}
#################
# STEP 01 - Create Data
#################

for i in {1..100000}; do d=`date +"%Y-%m-%d %H:%M:%S" --date="+$i days"`; echo "$i,$d,row
data" >> ./data.csv; done
ls -l ./*;
wc data.csv 
hdfs dfs -mkdir -p /user/root/external/t1
hdfs dfs -put ./data.csv /user/root/external/t1/data.csv
hdfs dfs -ls -R /user/root/external/t1/

Output:
-rw-r--r-- 1 root root 3488895 Feb  1 11:20 ./data.csv
~~~~~
100000  300000 3488895 data.csv
~~~~~
-rw-r--r--   3 root root    3488895 2017-02-01 11:26 /user/root/external/t1/data.csv

#################
# STEP 02 - Create RDBMS Table and Export Data
#################

export MYCONN=jdbc:oracle:thin:@oracle.cloudera.com:1521/db11g;
export MYUSER=sqoop
export MYPSWD=cloudera

sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "drop table t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "create table t1_text
(c1 int, c2 date, c3 varchar(10))"
sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir
/user/root/external/t1 --input-fields-terminated-by ',' --num-mappers 1
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*)
from t1_text"

Output:
17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Transferred 3.3274 MB in 24.8037 seconds (137.3688
KB/sec)
17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Exported 100000 records.
~~~~~~
------------------------
| COUNT(*)             | 
------------------------
| 100000               | 
------------------------

#################
# STEP 03 - Import Data as Text Creating 10 HDFS Files
#################

sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --target-dir
/user/root/external/t1_text --delete-target-dir --num-mappers 10 --split-by C1 --as-textfile
hdfs dfs -ls /user/root/external/t1_text/part*

Output:
17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Transferred 3.518 MB in 57.0517 seconds (63.1434
KB/sec)
17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
~~~~~
-rw-r--r--   3 root root     358894 2017-02-01 11:38 /user/root/external/t1_text/part-m-00000
-rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00001
-rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00002
-rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00003
-rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00004
-rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00005
-rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00006
-rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00007
-rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00008
-rw-r--r--   3 root root     370001 2017-02-01 11:38 /user/root/external/t1_text/part-m-00009

#################
# STEP 04 - Export 10 Text Formatted Data Using 2 Splits
#################

sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*)
from t1_text"
sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir
/user/root/external/t1_text --input-fields-terminated-by ',' --num-mappers 2
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*)
from t1_text"

Output:
------------------------
| COUNT(*)             | 
------------------------
| 0                    | 
------------------------
~~~~~
17/02/01 11:47:26 INFO input.FileInputFormat: Total input paths to process : 10
17/02/01 11:47:26 INFO mapreduce.JobSubmitter: number of splits:2
<SNIP>
17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in 31.7104 seconds (113.6324
KB/sec)
17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Exported 100000 records.
~~~~~
------------------------
| COUNT(*)             | 
------------------------
| 100000               | 
------------------------

#################
# STEP 05 - Import Data as Avro Creating 10 HDFS Files
#################

sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --target-dir
/user/root/external/t1_avro --delete-target-dir --num-mappers 10 --split-by C1 --as-avrodatafile
hdfs dfs -ls /user/root/external/t1_avro/*.avro

Output:
17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Transferred 2.3703 MB in 68.454 seconds (35.4568
KB/sec)
17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
~~~~~
-rw-r--r--   3 root root     231119 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00000.avro
-rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00001.avro
-rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00002.avro
-rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00003.avro
-rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00004.avro
-rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00005.avro
-rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00006.avro
-rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00007.avro
-rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00008.avro
-rw-r--r--   3 root root     250478 2017-02-01 11:56 /user/root/external/t1_avro/part-m-00009.avro

#################
# STEP 06 - Export 10 Avro Formatted Data Using 2 Splits (reproduction of issue)
#################

sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*)
from t1_text"
sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir
/user/root/external/t1_avro --input-fields-terminated-by ',' --num-mappers 2
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*)
from t1_text"

Output:
17/02/01 12:01:07 INFO input.FileInputFormat: Total input paths to process : 10
17/02/01 12:01:08 INFO mapreduce.JobSubmitter: number of splits:10 <==================
not correct, should be only 2 not 10
<SNIP>
17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Transferred 2.4497 MB in 57.1965 seconds (43.8575
KB/sec)
17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Exported 100000 records.
~~~~~
------------------------
| COUNT(*)             | 
------------------------
| 100000               | 
------------------------

#################
# STEP 07 - Export 10 Avro Formatted Data Using 2 Splits Using HCat Options (workaround)
#################

sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*)
from t1_text"
beeline -u "jdbc:hive2://hs2.coe.cloudera.com:10000" -n user1 -e "use default; drop table
t1_avro; create external table t1_avro (c1 int, c2 string, c3 string) row format delimited
fields terminated by ',' stored as avro location 'hdfs:///user/root/external/t1_avro'; select
count(*) from t1_avro;"
sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --hcatalog-database
default --hcatalog-table t1_avro --num-mappers 2  
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*)
from t1_text"

Output:
------------------------
| COUNT(*)             | 
------------------------
| 0                    | 
------------------------
~~~~~
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
~~~~~
17/02/01 13:41:54 INFO mapred.FileInputFormat: Total input paths to process : 10
17/02/01 13:41:54 INFO mapreduce.JobSubmitter: number of splits:2 <================ correct!
<SNIP>
17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Transferred 2.5225 MB in 48.7286 seconds (53.0082
KB/sec)
17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Exported 100000 records.
~~~~~
------------------------
| COUNT(*)             | 
------------------------
| 100000               | 
------------------------
{noformat}








--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message