sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwen Shapira <gshap...@cloudera.com>
Subject Re: Configurable NULL in IDF or Connector?
Date Mon, 01 Dec 2014 19:42:07 GMT
(hijacking the thread a bit for a related point)

I have some misgivings around how we manage the IDF now.

We go with a pretty specific CSV in order to avoid extra-processing
for MySQL/Postgres direct connectors.
I think the intent is to allow running LOAD DATA  without any processing.
Therefore we need to research and document the specific formats
required by MySQL and Postgres. Both DBs have pretty specific (and
often funky) formatting they need (If escaping is not used then NULL
is null, otherwise \N...)

If zero-processing load is not feasible, I'd re-consider the IDF and
lean toward a more structured format (Avro?).  If the connectors need
to parse the CSV and modify it, we are not gaining anything here. Or
at the very least benchmark to validate that CSV+processing is still
the fastest / least CPU option.

Gwen


On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <abe@cloudera.com> wrote:
> Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define
> it...
>
> Also, for #2... There are a few ways of generating output. It seems NULL
> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes sense.
>
> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <jarcec@apache.org>
> wrote:
>
>> I do share the same point of view as Gwen. The CSV format for UDF is very
>> strict so that we have minimal surface area for inconsistencies between
>> multiple connectors. This is because the IDF is an agreed upon exchange
>> format when transferring data from one connector to the other. That however
>> shouldn't stop one connector (such as HDFS) to offer ways to save the
>> resulting CSV differently.
>>
>> We had similar discussion about separator and quote characters in
>> SQOOP-1522 that seems to be relevant to the NULL discussion here.
>>
>> Jarcec
>>
>> > On Dec 1, 2014, at 10:42 AM, Gwen Shapira <gshapira@cloudera.com> wrote:
>> >
>> > I think its two different things:
>> >
>> > 1. HDFS connector should give more control over the formatting of the
>> > data in text files (nulls, escaping, etc)
>> > 2. IDF should give NULLs in a format that is optimized for
>> > MySQL/Postgres direct connectors (since thats one of the IDF design
>> > goals).
>> >
>> > Gwen
>> >
>> > On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <abe@cloudera.com>
>> wrote:
>> >> Hey guys,
>> >>
>> >> Any thoughts on where configurable NULL values should be? Either the
>> IDF or
>> >> HDFS connector?
>> >>
>> >> cf: https://issues.apache.org/jira/browse/SQOOP-1678
>> >>
>> >> -Abe
>>
>>

Mime
View raw message