sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Baxter <joshuagbax...@gmail.com>
Subject Re: how to use --as-parquetfile in sqoop import
Date Thu, 28 May 2015 01:15:42 GMT
I've also found the parquet as an output format doesn't work properly with
hcat import and doesn't deal with timestamp or decimal types without
crashing. This was using 1.4.5 and 1.4.6 client and cdh 5.3.

Even when just importing as text fields to hdfs folders, ending up with
files of a suitable size (parquet files aren't splittable) that wont
require rewriting to redistribute evenly is a guessing game. You need to
know how much data you expect to pull out, how many mappers that means you
should specify and you will also be run into problems if the split by
column does not have even distribution, unless you are using the oraoop
connector in which case splitting by a column is unnecessary and
distribution is pretty uniform.

We found the most hastle free method for us to pull from an oracle db was
to do a hcatalog import as sequence file, which correctly mapped the data
types, and then do a insert or create table select *  from the imported
table, converting to parquet at this point instead. Impala is convenient
for the second step if available as it manages parquet output file sizes
without any effort regardless of input data or requested output compression
type.

Jish
On 28 May 2015 01:49, "Brett Medalen" <bmedalen@hotmail.com> wrote:

> Not available until Sqoop 1.4.5 or 1.4.6
>
> On May 27, 2015, at 6:40 PM, Kumar Jayapal <kjayapal17@gmail.com> wrote:
>
> Hi,
>
> Can I use --as-parquetfile  argument while importing data to Hive?
>
> I have check the site
>
> https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_basic_usage
>
> I don't see this option any place mentioned.
>
>
>
>
>
>
>
>
>
> Thanks
> Jay
>
>

Mime
View raw message