sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rosenberg <...@squareup.com>
Subject Re: direct mode for mysql export vs. default jdbc
Date Tue, 25 Mar 2014 15:26:35 GMT
Ping!


On Mon, Mar 24, 2014 at 11:50 AM, Jason Rosenberg <jbr@squareup.com> wrote:

> Hi Kathleen, Et al,
>
> Any thoughts on my parameters above?  Would this work for --direct mode
> likely?
>
> Thanks,
>
> Jason
>
>
> On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <jbr@squareup.com> wrote:
>
>> Hi Kathleen,
>>
>> The sqoop command looks like this (a bash command, that takes a few args
>> naming the source/dest table names, update key, and dest columns).
>>
>> sqoop_table () {
>>   sqoop --export
>>     --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \
>>     --username $MYSQL_USER \
>>     --outdir $LOCAL_TEMP \
>>     --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \
>>     --table $2 \
>>     --update-key $3 \
>>     --update-mode allowinsert \
>>     --columns $4 \
>>     --input-fields-terminated-by "\t" \
>>     --input-null-string '\\N' \
>>     --input-null-non-string '\\N' \
>>     --batch \
>>     --num-mappers 8
>> }
>>
>> The lock contention exceptions seem to be retried, up to a point.  If
>> they happen too often, the hadoop job tracker eventually decides to kill
>> the job.  Even when the job succeeds, there are usually a few of these
>> exceptions.  I originally had much more mappers configured, but here, I've
>> reduced it to 8, and this seems to help (although it's not a guarantee).
>>  Reducing the num-mappers also makes it less likely that the target mysql
>> db machine will get overloaded (e.g. was getting maxed out on cpu usage
>> with 64 mappers, etc.).
>>
>> The number of records being sqooped is on the order of 1-5M at a time.
>>
>> Sounds like I should try --direct mode (but not sure if the
>> null/delimiters we're using will work with it?).
>>
>> Jason
>>
>>
>>
>> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <kathleen@apache.org>wrote:
>>
>>> Hi Jason,
>>>
>>> Rather than using the JDBC interface for transferring data, the direct
>>> mode delegates the job of transferring data to the native utilities
>>> provided by the database vendor. In the case of MySQL, the mysqldump
>>> and mysqlimport will be used for retrieving data from the database
>>> server or moving data back. Using native utilities will greatly
>>> improve performance, as they are optimized to provide the best
>>> possible transfer speed while putting less burden on the database
>>> server. That said, there are several limitations that come with this
>>> faster import. In the case of MySQL, each node hosting a TaskTracker
>>> service needs to have both mysqldump and mysqlimport utilities
>>> installed. Another limitation of the direct mode is that not all
>>> parameters are supported. As the native utilities usually produce text
>>> output, binary formats like SequenceFile or Avro won't work. Also,
>>> parameters that customize the escape characters, type mapping, column
>>> and row delimiters, or the NULL substitution string might not be
>>> supported in all cases.
>>>
>>> Can you share your entire Sqoop command and the contents of failed
>>> task attempt attempt_201403180842_0202_m_000002_1?
>>>
>>> Thanks,
>>> Kate
>>>
>>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <jbr@squareup.com>
>>> wrote:
>>> > Thoughts anyone?
>>> >
>>> > Thanks,
>>> >
>>> > Jason
>>> >
>>> >
>>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <jbr@squareup.com>
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I'm wondering if there is expected performance increases with using
>>> the
>>> >> --direct flag for exporting from hive to mysql.  If so, how much
>>> speedup?
>>> >>
>>> >> Also, I've been getting lock contention errors during export, and I'm
>>> >> wondering if these are less likely using --direct mode?  E.g. I'm
>>> getting
>>> >> these sorts of exceptions on the sqoop console:
>>> >>
>>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
>>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED
>>> >> java.io.IOException: Can't export data, please check failed map task
>>> logs
>>> >>         at
>>> >>
>>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
>>> >>         at
>>> >>
>>> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
>>> >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>>> >>         at
>>> >>
>>> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
>>> >>         at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>>> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>>> >>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>> >>         at java.security.AccessController.doPrivileged(Native Method)
>>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>> >>         at
>>> >>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>>> >>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException:
>>> Deadlock
>>> >> found when trying to get lock; try restarting transaction
>>> >>         at
>>> >> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
>>> >>
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Jason
>>> >
>>> >
>>>
>>
>>
>

Mime
View raw message