sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwen Shapira <gshap...@cloudera.com>
Subject Re: Configurable NULL in IDF or Connector?
Date Mon, 01 Dec 2014 23:31:34 GMT
Performance numbers would be sweet at some point for sure.
Based on some rough tests we did in the field (on another project),
Avro serialization does have significant overhead (I think Hive
writing CSV was 20% faster than to Avro, I can dig up my results
later). It may be even worse for Sqoop since Hive does serialization
in batches.

This is not completely scientific, but leads me to believe that as
much as I love Avro, we'll need a good reason to use it internally.

On Mon, Dec 1, 2014 at 3:19 PM, Veena Basavaraj <vbasavaraj@cloudera.com> wrote:
> Jarcec,
>
> If we were more metrics driven/ with some tests and/or benchmarks to prove
> how much fast this would be, it would have been great. Just a suggestion.
>
> Gwen probably meant the same as well.
>
>
>
>
>
>
> Best,
> *./Vee*
>
> On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho <jarcec@apache.org>
> wrote:
>
>> Gwen,
>> we’ve investigated mysqldump, pg_dump and few others already, the results
>> are on the wiki [1]. The resulting CSV-ish specification is following those
>> two very closely.
>>
>> In MySQL case specifically, I’ve looked into mysqldump output rather then
>> “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires
>> the file to exists on the database machine whereas mysqldump/mysqlimport
>> allows us to import data to the database from any machine on the Hadoop
>> cluster.
>>
>> Jarcec
>>
>> Links:
>> 1:
>> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation
>>
>> > On Dec 1, 2014, at 11:55 AM, Gwen Shapira <gshapira@cloudera.com> wrote:
>> >
>> > Agreed. I hope we'll have at least one direct connector real soon now
>> > to prove it.
>> >
>> > Reading this:
>> > http://dev.mysql.com/doc/refman/5.6/en/load-data.html
>> > was a bit discouraging...
>> >
>> > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <abe@cloudera.com>
>> wrote:
>> >> My understanding is that MySQL and PostgreSQL can output to CSV in the
>> >> suggested format.
>> >>
>> >> NOTE: getTextData() and setTextData() APIs are effectively useless if
>> >> reduced processing load is not possible.
>> >>
>> >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <gshapira@cloudera.com>
>> wrote:
>> >>
>> >>> (hijacking the thread a bit for a related point)
>> >>>
>> >>> I have some misgivings around how we manage the IDF now.
>> >>>
>> >>> We go with a pretty specific CSV in order to avoid extra-processing
>> >>> for MySQL/Postgres direct connectors.
>> >>> I think the intent is to allow running LOAD DATA  without any
>> processing.
>> >>> Therefore we need to research and document the specific formats
>> >>> required by MySQL and Postgres. Both DBs have pretty specific (and
>> >>> often funky) formatting they need (If escaping is not used then NULL
>> >>> is null, otherwise \N...)
>> >>>
>> >>> If zero-processing load is not feasible, I'd re-consider the IDF and
>> >>> lean toward a more structured format (Avro?).  If the connectors need
>> >>> to parse the CSV and modify it, we are not gaining anything here. Or
>> >>> at the very least benchmark to validate that CSV+processing is still
>> >>> the fastest / least CPU option.
>> >>>
>> >>> Gwen
>> >>>
>> >>>
>> >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <abe@cloudera.com>
>> >>> wrote:
>> >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me
>> re-define
>> >>>> it...
>> >>>>
>> >>>> Also, for #2... There are a few ways of generating output. It seems
>> NULL
>> >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes
>> >>> sense.
>> >>>>
>> >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <
>> jarcec@apache.org>
>> >>>> wrote:
>> >>>>
>> >>>>> I do share the same point of view as Gwen. The CSV format for
UDF is
>> >>> very
>> >>>>> strict so that we have minimal surface area for inconsistencies
>> between
>> >>>>> multiple connectors. This is because the IDF is an agreed upon
>> exchange
>> >>>>> format when transferring data from one connector to the other.
That
>> >>> however
>> >>>>> shouldn't stop one connector (such as HDFS) to offer ways to
save the
>> >>>>> resulting CSV differently.
>> >>>>>
>> >>>>> We had similar discussion about separator and quote characters
in
>> >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion
here.
>> >>>>>
>> >>>>> Jarcec
>> >>>>>
>> >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <gshapira@cloudera.com>
>> >>> wrote:
>> >>>>>>
>> >>>>>> I think its two different things:
>> >>>>>>
>> >>>>>> 1. HDFS connector should give more control over the formatting
of
>> the
>> >>>>>> data in text files (nulls, escaping, etc)
>> >>>>>> 2. IDF should give NULLs in a format that is optimized for
>> >>>>>> MySQL/Postgres direct connectors (since thats one of the
IDF design
>> >>>>>> goals).
>> >>>>>>
>> >>>>>> Gwen
>> >>>>>>
>> >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <abe@cloudera.com>
>> >>>>> wrote:
>> >>>>>>> Hey guys,
>> >>>>>>>
>> >>>>>>> Any thoughts on where configurable NULL values should
be? Either
>> the
>> >>>>> IDF or
>> >>>>>>> HDFS connector?
>> >>>>>>>
>> >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678
>> >>>>>>>
>> >>>>>>> -Abe
>> >>>>>
>> >>>>>
>> >>>
>>
>>

Mime
View raw message