sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nicolas.pa...@riseup.net>
Subject Re: Export PostgreSQL Direct
Date Mon, 12 Nov 2018 10:27:00 GMT
On Sat, Nov 10, 2018 at 05:10:01PM +0100, Nicolas Paris wrote:
> I have been able to load data from **csv only** format with sqoop export
> in **direct mode**. While very fast (parallel copy statements!),  the
> method is not robust in case the data do have varchar columns.  In
> particular a varchar column may contain **newlines** and this breaks the
> mapper job, since it splits the csv by newlines.
> That's too bad, because the *copy* statement can handle *newlined csv*.
> 
> 1) Is there any way to only send a whole hdfs file per mapper instead of
> splitting them ? That would work well.

Finally I found a workaround. Setting the number of mapper to one(-m 1),
makes no split in the data, and COPY handle well the multi line CSV
produced by HIVE.

> sqoop export --connect "jdbc:postgresql://<postgres_host>/<postgres_db>"
--username <postgres_user> --password-file file:///home/$USER/.password --export-dir
/apps/hive/warehouse/<my_db>/<my_table_path> --table <my_hive_table> --columns
"id, text_column" -m 1  --direct --lines-terminated-by '\n'  --fields-terminated-by ','  --input-null-string
"\\\\N" --input-null-non-string "\\\\N"  -- --schema <my_schema>

This should works in any situation (excluding binary columns). Finally I am not
sure using one COPY instead of parallel COPY is worst: less agressive with the
database while great performance.


regards,

-- 
nicolas

Mime
View raw message