sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Veena Basavaraj <vbasava...@cloudera.com>
Subject Re: Configurable NULL in IDF or Connector?
Date Mon, 01 Dec 2014 23:37:25 GMT
ah, please do add these details as a comment to the wiki Gwen. I am glad we
discussed this.

Also, the data set size and structure of the data set ( nested types or not
) and the environment (machines) and various other things matter when we
says writing CSV is 20% faster then Avro. But it is tradeoff one should be
able to make and choose what works best for them. In some cases, one might
be willing to tradeoff some speed to a structured format such as Avro, it
might be also the case the the destination format ( the the TO) expects it
to be written in Avro.




Best,
*./Vee*

On Mon, Dec 1, 2014 at 3:31 PM, Gwen Shapira <gshapira@cloudera.com> wrote:

> Performance numbers would be sweet at some point for sure.
> Based on some rough tests we did in the field (on another project),
> Avro serialization does have significant overhead (I think Hive
> writing CSV was 20% faster than to Avro, I can dig up my results
> later). It may be even worse for Sqoop since Hive does serialization
> in batches.
>
> This is not completely scientific, but leads me to believe that as
> much as I love Avro, we'll need a good reason to use it internally.
>
> On Mon, Dec 1, 2014 at 3:19 PM, Veena Basavaraj <vbasavaraj@cloudera.com>
> wrote:
> > Jarcec,
> >
> > If we were more metrics driven/ with some tests and/or benchmarks to
> prove
> > how much fast this would be, it would have been great. Just a suggestion.
> >
> > Gwen probably meant the same as well.
> >
> >
> >
> >
> >
> >
> > Best,
> > *./Vee*
> >
> > On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho <jarcec@apache.org>
> > wrote:
> >
> >> Gwen,
> >> we’ve investigated mysqldump, pg_dump and few others already, the
> results
> >> are on the wiki [1]. The resulting CSV-ish specification is following
> those
> >> two very closely.
> >>
> >> In MySQL case specifically, I’ve looked into mysqldump output rather
> then
> >> “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires
> >> the file to exists on the database machine whereas mysqldump/mysqlimport
> >> allows us to import data to the database from any machine on the Hadoop
> >> cluster.
> >>
> >> Jarcec
> >>
> >> Links:
> >> 1:
> >>
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation
> >>
> >> > On Dec 1, 2014, at 11:55 AM, Gwen Shapira <gshapira@cloudera.com>
> wrote:
> >> >
> >> > Agreed. I hope we'll have at least one direct connector real soon now
> >> > to prove it.
> >> >
> >> > Reading this:
> >> > http://dev.mysql.com/doc/refman/5.6/en/load-data.html
> >> > was a bit discouraging...
> >> >
> >> > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <abe@cloudera.com>
> >> wrote:
> >> >> My understanding is that MySQL and PostgreSQL can output to CSV in
> the
> >> >> suggested format.
> >> >>
> >> >> NOTE: getTextData() and setTextData() APIs are effectively useless
if
> >> >> reduced processing load is not possible.
> >> >>
> >> >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <gshapira@cloudera.com
> >
> >> wrote:
> >> >>
> >> >>> (hijacking the thread a bit for a related point)
> >> >>>
> >> >>> I have some misgivings around how we manage the IDF now.
> >> >>>
> >> >>> We go with a pretty specific CSV in order to avoid extra-processing
> >> >>> for MySQL/Postgres direct connectors.
> >> >>> I think the intent is to allow running LOAD DATA  without any
> >> processing.
> >> >>> Therefore we need to research and document the specific formats
> >> >>> required by MySQL and Postgres. Both DBs have pretty specific (and
> >> >>> often funky) formatting they need (If escaping is not used then
NULL
> >> >>> is null, otherwise \N...)
> >> >>>
> >> >>> If zero-processing load is not feasible, I'd re-consider the IDF
and
> >> >>> lean toward a more structured format (Avro?).  If the connectors
> need
> >> >>> to parse the CSV and modify it, we are not gaining anything here.
Or
> >> >>> at the very least benchmark to validate that CSV+processing is
still
> >> >>> the fastest / least CPU option.
> >> >>>
> >> >>> Gwen
> >> >>>
> >> >>>
> >> >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <abe@cloudera.com
> >
> >> >>> wrote:
> >> >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let
me
> >> re-define
> >> >>>> it...
> >> >>>>
> >> >>>> Also, for #2... There are a few ways of generating output.
It seems
> >> NULL
> >> >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL
makes
> >> >>> sense.
> >> >>>>
> >> >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <
> >> jarcec@apache.org>
> >> >>>> wrote:
> >> >>>>
> >> >>>>> I do share the same point of view as Gwen. The CSV format
for UDF
> is
> >> >>> very
> >> >>>>> strict so that we have minimal surface area for inconsistencies
> >> between
> >> >>>>> multiple connectors. This is because the IDF is an agreed
upon
> >> exchange
> >> >>>>> format when transferring data from one connector to the
other.
> That
> >> >>> however
> >> >>>>> shouldn't stop one connector (such as HDFS) to offer ways
to save
> the
> >> >>>>> resulting CSV differently.
> >> >>>>>
> >> >>>>> We had similar discussion about separator and quote characters
in
> >> >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion
here.
> >> >>>>>
> >> >>>>> Jarcec
> >> >>>>>
> >> >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <gshapira@cloudera.com
> >
> >> >>> wrote:
> >> >>>>>>
> >> >>>>>> I think its two different things:
> >> >>>>>>
> >> >>>>>> 1. HDFS connector should give more control over the
formatting of
> >> the
> >> >>>>>> data in text files (nulls, escaping, etc)
> >> >>>>>> 2. IDF should give NULLs in a format that is optimized
for
> >> >>>>>> MySQL/Postgres direct connectors (since thats one of
the IDF
> design
> >> >>>>>> goals).
> >> >>>>>>
> >> >>>>>> Gwen
> >> >>>>>>
> >> >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <
> abe@cloudera.com>
> >> >>>>> wrote:
> >> >>>>>>> Hey guys,
> >> >>>>>>>
> >> >>>>>>> Any thoughts on where configurable NULL values
should be? Either
> >> the
> >> >>>>> IDF or
> >> >>>>>>> HDFS connector?
> >> >>>>>>>
> >> >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678
> >> >>>>>>>
> >> >>>>>>> -Abe
> >> >>>>>
> >> >>>>>
> >> >>>
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message