sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert B Hamilton <robert.hamil...@gm.com>
Subject RE: [EXTERNAL] Re: Postgres direct
Date Fri, 25 Jan 2019 22:17:15 GMT
Hi. Szabolcs and many thanks!

I put in a pull request but the CI is failing on some oracle checks that are not related to
any of these changes.  So I'm not sure where to go from there.

There are several changes we put into this one. The first two are the most important since
without them it would be pretty much a showstopper for our shop:

1. I added optional support for version 8.x of postgress, for which the COPY command has a
different syntax than later versions.

2. I also noticed some performance challenges, which I think are related to mapreduce mapper
feeding lines to the CopyIn object one by one.This could introduce some considerable delays
if the database is some network distant from the Hadoop cluster.     So I added a rudimentary
line buffering and the export was about 10 to 20 times faster.  The option is enabled by using
 --batch in combination with --direct.  The only downside seems to be that if a data issue
causes failure we cannot identify the exact row that has the problem, only that it was in
that particular batch.

3. We use '\x1c' for our standard delimiter in text mode. But unfortunately that is not a
valid XML char so is difficult to pass to the map tasks via standard Hadoop config.  This
introduces a limitation to direct mode that is not present in non direct exports.  I made
a minor change to base64 encode this before putting into config and base64 decode when initializing
the map. So with this change we can do - -fields-terminated-by '\0x1c' and it just works.

4. Likewise, using empty string to represent null is supported by non-direct but not by direct
mode. I made a modification so that  --null-string ''  also works now with direct mode

5. We have some bizarre data that contains strings with escape chars, things like '\x96'.
 Its in a verbatim field and we are required to load it into the database table as is. (yes
I know the security implications 😊)
   COPY normally interprets these escape sequences and converts them to chars before inserting
into tables. Big problem if the escape sequence evaluates to a non-utf char. So I added an
option to disable this escape interpretation.

6. I added a switch -Dpostegresql.format.text=true, which allows us to switch to TEXT mode
copy instead of CSV mode.  This is convenient when we have data coming in from many sources
that don't know how to properly format a CSV file.

PS please forgive the verbiage in the Confidentiality note; I don't have any way to remove
it when mailing from a corporate account....
--

From: Szabolcs Vasas [mailto:vasas@apache.org]
Sent: Friday, January 18, 2019 7:05 AM
To: user@sqoop.apache.org
Subject: [EXTERNAL] Re: Postgres direct

Hi Robert,

Sounds great! We can discuss it in this chain or you could create a pull request and start
the discussion there.

On Wed, Jan 16, 2019 at 12:05 AM Attila Szabó <mailto:maugli@apache.org> wrote:
Hi Robert,

I would be happy to assist you with this change.

Cheers,
Attila
On Tue, Jan 15, 2019, 11:55 PM Robert B Hamilton <mailto:robert.hamilton@gm.com wrote:
Hi.  Is there much active work being done on the Postgres direct export?

I have found that there is very significant performance improvement if we buffer lines into
the CopyIn in the map method. We've incorporated this in-house with some pretty good results.

Who would I contact if I wanted to share this code? I know about the dev mail list but I would
like to talk it over first before jumping in with a jira...

There are also a few other features that we've added like an option for CSV vs TEXT format,
 and some miscellaneous options parsing.


Nothing in this message is intended to constitute an electronic signature unless a specific
statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to which it is
addressed. It may contain confidential and/or privileged material. Any review, transmission,
dissemination or other use, or taking of any action in reliance upon this message by persons
or entities other than the intended recipient is prohibited and may be unlawful. If you received
this message in error, please contact the sender and delete it from your computer.


Nothing in this message is intended to constitute an electronic signature unless a specific
statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to which it is
addressed. It may contain confidential and/or privileged material. Any review, transmission,
dissemination or other use, or taking of any action in reliance upon this message by persons
or entities other than the intended recipient is prohibited and may be unlawful. If you received
this message in error, please contact the sender and delete it from your computer.
Mime
View raw message