sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Baxter <joshuagbax...@gmail.com>
Subject --as-parquet-file, Oraoop and Decimal and Timestamp types
Date Tue, 02 Dec 2014 18:17:01 GMT
I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
partitions of a large fact table and getting some great speed. There are
not any columns i can evenly split by with the default connector but with
Oraoop I can get evenly sized parquet files that i can use directly in
impala and hive without incurring remote reads. A couple things i have
noticed though.

   - Decimal fields are getting exported as strings. SQOOP-1445 refers to
   this but it sounds like a fix isn't planned due to the HCatalog support.
   Unfortunately the direct connectors, apart from Netezza, are not currently
   not supported.
   - You need to use option -Doraoop.timestamp.string=false otherwise you
   get an Not in union ["long","null"]: 2014-07-24 00:00:00 exception due to
   the intermediary file format. However the resulting parquet file is a
   double rather then a hive or impala compatible timestamp.

Here is what i am running now.

sqoop import  -Doraoop.chunk.method=ROWID -Doraoop.timestamp.string=false
-Doraoop.import.partitions=${PARTITION} \
--direct \
--connect jdbc:oracle:thin:@//${DATABASE}  \
--table "${TABLE}" \
--columns COL1,COL2,COL3,COL4,COL5,COL6 \
--map-column-java  COL1=Long,COL2=Long,COL3=Long,COL4=Long \
--m 48 \
--target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \

COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a the
size of a long so I've remapped those to save space. COL5 is a Decimal and
COL6 is a DATE. Is there any way I can remap these also so that they are
written into the parquet file as DECIMAL and timestamp compatible types
respectively so there isn't a needed to redefine these columns.

Many Thanks


View raw message