sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Szabolcs Vasas <va...@apache.org>
Subject Re: [EXTERNAL] Re: Postgres direct
Date Thu, 31 Jan 2019 10:19:27 GMT
Hi Robert,

Sorry for the late reply, I have been quite busy recently, I will try to
post at least an initial review by the end of tomorrow.
Don't worry about the Oracle tests this is again something we will need to
address separately.

Szabolcs

On Fri, Jan 25, 2019 at 11:17 PM Robert B Hamilton <robert.hamilton@gm.com>
wrote:

> Hi. Szabolcs and many thanks!
>
> I put in a pull request but the CI is failing on some oracle checks that
> are not related to any of these changes.  So I'm not sure where to go from
> there.
>
> There are several changes we put into this one. The first two are the most
> important since without them it would be pretty much a showstopper for our
> shop:
>
> 1. I added optional support for version 8.x of postgress, for which the
> COPY command has a different syntax than later versions.
>
> 2. I also noticed some performance challenges, which I think are related
> to mapreduce mapper feeding lines to the CopyIn object one by one.This
> could introduce some considerable delays if the database is some network
> distant from the Hadoop cluster.     So I added a rudimentary line
> buffering and the export was about 10 to 20 times faster.  The option is
> enabled by using  --batch in combination with --direct.  The only downside
> seems to be that if a data issue causes failure we cannot identify the
> exact row that has the problem, only that it was in that particular batch.
>
> 3. We use '\x1c' for our standard delimiter in text mode. But
> unfortunately that is not a valid XML char so is difficult to pass to the
> map tasks via standard Hadoop config.  This introduces a limitation to
> direct mode that is not present in non direct exports.  I made a minor
> change to base64 encode this before putting into config and base64 decode
> when initializing the map. So with this change we can do -
> -fields-terminated-by '\0x1c' and it just works.
>
> 4. Likewise, using empty string to represent null is supported by
> non-direct but not by direct mode. I made a modification so that
> --null-string ''  also works now with direct mode
>
> 5. We have some bizarre data that contains strings with escape chars,
> things like '\x96'.  Its in a verbatim field and we are required to load it
> into the database table as is. (yes I know the security implications 😊)
>    COPY normally interprets these escape sequences and converts them to
> chars before inserting into tables. Big problem if the escape sequence
> evaluates to a non-utf char. So I added an option to disable this escape
> interpretation.
>
> 6. I added a switch -Dpostegresql.format.text=true, which allows us to
> switch to TEXT mode copy instead of CSV mode.  This is convenient when we
> have data coming in from many sources that don't know how to properly
> format a CSV file.
>
> PS please forgive the verbiage in the Confidentiality note; I don't have
> any way to remove it when mailing from a corporate account....
> --
>
> From: Szabolcs Vasas [mailto:vasas@apache.org]
> Sent: Friday, January 18, 2019 7:05 AM
> To: user@sqoop.apache.org
> Subject: [EXTERNAL] Re: Postgres direct
>
> Hi Robert,
>
> Sounds great! We can discuss it in this chain or you could create a pull
> request and start the discussion there.
>
> On Wed, Jan 16, 2019 at 12:05 AM Attila Szabó <mailto:maugli@apache.org>
> wrote:
> Hi Robert,
>
> I would be happy to assist you with this change.
>
> Cheers,
> Attila
> On Tue, Jan 15, 2019, 11:55 PM Robert B Hamilton <mailto:
> robert.hamilton@gm.com wrote:
> Hi.  Is there much active work being done on the Postgres direct export?
>
> I have found that there is very significant performance improvement if we
> buffer lines into the CopyIn in the map method. We've incorporated this
> in-house with some pretty good results.
>
> Who would I contact if I wanted to share this code? I know about the dev
> mail list but I would like to talk it over first before jumping in with a
> jira...
>
> There are also a few other features that we've added like an option for
> CSV vs TEXT format,  and some miscellaneous options parsing.
>
>
> Nothing in this message is intended to constitute an electronic signature
> unless a specific statement to the contrary is included in this message.
>
> Confidentiality Note: This message is intended only for the person or
> entity to which it is addressed. It may contain confidential and/or
> privileged material. Any review, transmission, dissemination or other use,
> or taking of any action in reliance upon this message by persons or
> entities other than the intended recipient is prohibited and may be
> unlawful. If you received this message in error, please contact the sender
> and delete it from your computer.
>
>
> Nothing in this message is intended to constitute an electronic signature
> unless a specific statement to the contrary is included in this message.
>
> Confidentiality Note: This message is intended only for the person or
> entity to which it is addressed. It may contain confidential and/or
> privileged material. Any review, transmission, dissemination or other use,
> or taking of any action in reliance upon this message by persons or
> entities other than the intended recipient is prohibited and may be
> unlawful. If you received this message in error, please contact the sender
> and delete it from your computer.
>

Mime
View raw message