sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwen Shapira <gshap...@cloudera.com>
Subject Re: Configurable NULL in IDF or Connector?
Date Mon, 01 Dec 2014 19:55:00 GMT
Agreed. I hope we'll have at least one direct connector real soon now
to prove it.

Reading this:
http://dev.mysql.com/doc/refman/5.6/en/load-data.html
was a bit discouraging...

On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <abe@cloudera.com> wrote:
> My understanding is that MySQL and PostgreSQL can output to CSV in the
> suggested format.
>
> NOTE: getTextData() and setTextData() APIs are effectively useless if
> reduced processing load is not possible.
>
> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <gshapira@cloudera.com> wrote:
>
>> (hijacking the thread a bit for a related point)
>>
>> I have some misgivings around how we manage the IDF now.
>>
>> We go with a pretty specific CSV in order to avoid extra-processing
>> for MySQL/Postgres direct connectors.
>> I think the intent is to allow running LOAD DATA  without any processing.
>> Therefore we need to research and document the specific formats
>> required by MySQL and Postgres. Both DBs have pretty specific (and
>> often funky) formatting they need (If escaping is not used then NULL
>> is null, otherwise \N...)
>>
>> If zero-processing load is not feasible, I'd re-consider the IDF and
>> lean toward a more structured format (Avro?).  If the connectors need
>> to parse the CSV and modify it, we are not gaining anything here. Or
>> at the very least benchmark to validate that CSV+processing is still
>> the fastest / least CPU option.
>>
>> Gwen
>>
>>
>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <abe@cloudera.com>
>> wrote:
>> > Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define
>> > it...
>> >
>> > Also, for #2... There are a few ways of generating output. It seems NULL
>> > values range from "\N" to 0x0 to "NULL". I think keeping NULL makes
>> sense.
>> >
>> > On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <jarcec@apache.org>
>> > wrote:
>> >
>> >> I do share the same point of view as Gwen. The CSV format for UDF is
>> very
>> >> strict so that we have minimal surface area for inconsistencies between
>> >> multiple connectors. This is because the IDF is an agreed upon exchange
>> >> format when transferring data from one connector to the other. That
>> however
>> >> shouldn't stop one connector (such as HDFS) to offer ways to save the
>> >> resulting CSV differently.
>> >>
>> >> We had similar discussion about separator and quote characters in
>> >> SQOOP-1522 that seems to be relevant to the NULL discussion here.
>> >>
>> >> Jarcec
>> >>
>> >> > On Dec 1, 2014, at 10:42 AM, Gwen Shapira <gshapira@cloudera.com>
>> wrote:
>> >> >
>> >> > I think its two different things:
>> >> >
>> >> > 1. HDFS connector should give more control over the formatting of the
>> >> > data in text files (nulls, escaping, etc)
>> >> > 2. IDF should give NULLs in a format that is optimized for
>> >> > MySQL/Postgres direct connectors (since thats one of the IDF design
>> >> > goals).
>> >> >
>> >> > Gwen
>> >> >
>> >> > On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <abe@cloudera.com>
>> >> wrote:
>> >> >> Hey guys,
>> >> >>
>> >> >> Any thoughts on where configurable NULL values should be? Either
the
>> >> IDF or
>> >> >> HDFS connector?
>> >> >>
>> >> >> cf: https://issues.apache.org/jira/browse/SQOOP-1678
>> >> >>
>> >> >> -Abe
>> >>
>> >>
>>

Mime
View raw message