sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi, Chandramouli" <Chandramouli.R...@vantiv.com>
Subject RE: sqoop splitting
Date Wed, 22 Feb 2017 14:59:45 GMT
1) Split key should be Numeric for the split to be even
2) --boundary-query <statement> may not be helpful too in your case as the distribution
is not even.

Is there a possibility to select any other Column as Split Key?

Thanks,
Chandra

From: Brotanek, Jan [mailto:Jan.Brotanek@adastragrp.com]
Sent: Wednesday, February 22, 2017 9:51 AM
To: user@sqoop.apache.org
Subject: sqoop splitting

Hello, I am transferring incremental data from Oracle table using sqoop 1.4.2. direct using
following command:

sqoop import --libjars /usr/hdp/2.4.0.0-169/sqoop/lib/ojdbc6.jar --direct --connect jdbc:oracle:thin:@//xxx/xxx
--username xxx --password xxx --table A.A1 --as-textfile --optionally-enclosed-by '"' --check-column
DAY  --incremental append --last-value "DATE'2016-09-30'" --split-by DAY --warehouse-dir /tmp/benchmarks/
-m4 -z;



17/01/21 14:36:20 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(DAY), MAX(DAY)
FROM A.A1 WHERE ( DAY > DATE'2016-09-30' AND DAY <= TO_TIMESTAMP('2098-01-21 13:09:16.0',
'YYYY-MM-DD HH24:MI:SS.FF') )

In Oracle database it looks like this:
... WHERE ( DAY > DATE'2016-09-30' AND DAY <= TO_TIMESTAMP('2098-01-21 13:09:16.0',
'YYYY-MM-DD HH24:MI:SS.FF') ) AND ( DAY >= TO_TIMESTAMP('2016-09-30 00:00:01.0', 'YYYY-MM-DD
HH24:MI:SS.FF') ) AND ( DAY < TO_TIMESTAMP('2037-01-27 14:32:19.75', 'YYYY-MM-DD HH24:MI:SS.FF')
)

As far as I understand, sqoop determine max and min value, then it divides interval by number
of mappers, then starts to select from 1st interval ... 4th interval and retrieves data to
HDFS

A1 Table has some dummy year 2098 records which are used by other systems and I cannot remove
them from data.
This will split table to 4 mappers where 3 mappers have few data and 4th one is overloaded,
resulting in having 4files from which only one contains most of table's data, which causes
problems in transforming data to ORC tables.

Desired output would be having 4files split evenly. Is there any approach I can achieve it?

Thanks
Jan


Mime
View raw message