sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rosenberg <...@squareup.com>
Subject Re: direct mode for mysql export vs. default jdbc
Date Wed, 26 Mar 2014 02:38:21 GMT
Thanks Kate,

Very helpful.  Is there any chance in the future of having the null
substitution handling supported in direct mode?  Or is it a fundamental
issue with hive itself?

I'll try reducing the number of mappers further....

Jason


On Tue, Mar 25, 2014 at 8:26 PM, Kathleen Ting <kathleen@apache.org> wrote:

> Hi Jason,
>
> Sorry for the delay.
>
> Generally speaking, you'll want to decrease num-mappers (default is 4)
> to lessen the load on the db (but ingest rate will decline) and you'll
> want to increase num-mappers to improve the ingest rate (but db load
> will be negatively impacted). Beyond that we can't give any
> recommendation with regards to the number of mappers as every
> environment is different. The main bottleneck in a Sqoop job is the
> shared database system. Sqoop can scale only to the extent that is
> allowed by the particular database. Furthermore, this can differ from
> table to table even within a single database system depending on what
> disks the particular import table uses.
>
> The MySQL direct connector uses a native utility called mysqldump to
> perform a highly efficient data transfer between the MySQL server and
> Hadoop cluster. This utility unfortunately does not support using
> custom NULL substitution strings and will always import missing values
> as a string constant NULL. This is very confusing on the Hive side, as
> the Hive shell will display the value as NULL as well. It won't be
> perceived as a missing value, but as a valid string constant. You need
> to turn off direct mode (by omitting the --direct option) in order to
> override the default NULL substitution string.
>
> Regards,
> Kate
>
> On Tue, Mar 25, 2014 at 8:26 AM, Jason Rosenberg <jbr@squareup.com> wrote:
> > Ping!
> >
> >
> > On Mon, Mar 24, 2014 at 11:50 AM, Jason Rosenberg <jbr@squareup.com>
> wrote:
> >>
> >> Hi Kathleen, Et al,
> >>
> >> Any thoughts on my parameters above?  Would this work for --direct mode
> >> likely?
> >>
> >> Thanks,
> >>
> >> Jason
> >>
> >>
> >> On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <jbr@squareup.com>
> wrote:
> >>>
> >>> Hi Kathleen,
> >>>
> >>> The sqoop command looks like this (a bash command, that takes a few
> args
> >>> naming the source/dest table names, update key, and dest columns).
> >>>
> >>> sqoop_table () {
> >>>   sqoop --export
> >>>     --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \
> >>>     --username $MYSQL_USER \
> >>>     --outdir $LOCAL_TEMP \
> >>>     --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \
> >>>     --table $2 \
> >>>     --update-key $3 \
> >>>     --update-mode allowinsert \
> >>>     --columns $4 \
> >>>     --input-fields-terminated-by "\t" \
> >>>     --input-null-string '\\N' \
> >>>     --input-null-non-string '\\N' \
> >>>     --batch \
> >>>     --num-mappers 8
> >>> }
> >>>
> >>> The lock contention exceptions seem to be retried, up to a point.  If
> >>> they happen too often, the hadoop job tracker eventually decides to
> kill the
> >>> job.  Even when the job succeeds, there are usually a few of these
> >>> exceptions.  I originally had much more mappers configured, but here,
> I've
> >>> reduced it to 8, and this seems to help (although it's not a
> guarantee).
> >>> Reducing the num-mappers also makes it less likely that the target
> mysql db
> >>> machine will get overloaded (e.g. was getting maxed out on cpu usage
> with 64
> >>> mappers, etc.).
> >>>
> >>> The number of records being sqooped is on the order of 1-5M at a time.
> >>>
> >>> Sounds like I should try --direct mode (but not sure if the
> >>> null/delimiters we're using will work with it?).
> >>>
> >>> Jason
> >>>
> >>>
> >>>
> >>> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <kathleen@apache.org>
> >>> wrote:
> >>>>
> >>>> Hi Jason,
> >>>>
> >>>> Rather than using the JDBC interface for transferring data, the direct
> >>>> mode delegates the job of transferring data to the native utilities
> >>>> provided by the database vendor. In the case of MySQL, the mysqldump
> >>>> and mysqlimport will be used for retrieving data from the database
> >>>> server or moving data back. Using native utilities will greatly
> >>>> improve performance, as they are optimized to provide the best
> >>>> possible transfer speed while putting less burden on the database
> >>>> server. That said, there are several limitations that come with this
> >>>> faster import. In the case of MySQL, each node hosting a TaskTracker
> >>>> service needs to have both mysqldump and mysqlimport utilities
> >>>> installed. Another limitation of the direct mode is that not all
> >>>> parameters are supported. As the native utilities usually produce text
> >>>> output, binary formats like SequenceFile or Avro won't work. Also,
> >>>> parameters that customize the escape characters, type mapping, column
> >>>> and row delimiters, or the NULL substitution string might not be
> >>>> supported in all cases.
> >>>>
> >>>> Can you share your entire Sqoop command and the contents of failed
> >>>> task attempt attempt_201403180842_0202_m_000002_1?
> >>>>
> >>>> Thanks,
> >>>> Kate
> >>>>
> >>>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <jbr@squareup.com>
> >>>> wrote:
> >>>> > Thoughts anyone?
> >>>> >
> >>>> > Thanks,
> >>>> >
> >>>> > Jason
> >>>> >
> >>>> >
> >>>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jbr@squareup.com>
> >>>> > wrote:
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> I'm wondering if there is expected performance increases with
using
> >>>> >> the
> >>>> >> --direct flag for exporting from hive to mysql.  If so, how
much
> >>>> >> speedup?
> >>>> >>
> >>>> >> Also, I've been getting lock contention errors during export,
and
> I'm
> >>>> >> wondering if these are less likely using --direct mode?  E.g.
I'm
> >>>> >> getting
> >>>> >> these sorts of exceptions on the sqoop console:
> >>>> >>
> >>>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
> >>>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED
> >>>> >> java.io.IOException: Can't export data, please check failed
map
> task
> >>>> >> logs
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
> >>>> >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
> >>>> >>         at
> >>>> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
> >>>> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> >>>> >>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> >>>> >>         at java.security.AccessController.doPrivileged(Native
> Method)
> >>>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> >>>> >>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
> >>>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException:
> >>>> >> Deadlock
> >>>> >> found when trying to get lock; try restarting transaction
> >>>> >>         at
> >>>> >>
> >>>> >>
> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
> >>>> >>
> >>>> >>
> >>>> >> Thanks,
> >>>> >>
> >>>> >> Jason
> >>>> >
> >>>> >
> >>>
> >>>
> >>
> >
>

Mime
View raw message