sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom Harrison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-2056) Support for Mysql Sqoop Metastore
Date Fri, 11 Aug 2017 20:47:00 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124015#comment-16124015

Tom Harrison commented on SQOOP-2056:

Just as a note for others tracking down this issue...

As of 2017 Cloudera is still recommending Sqoop 1

and, there are multiple easily found documents on using other external DBs than HSQL all of
which miss the several key observations made by the OP which I'll expand upon (mostly just
for others who may be tracking down a problem)

Sqoop 1 will fail in cases where there is a concurrency issue during updates to the metastore.
 It will correctly return a non-zero return code and log the error.  However a failure case
during an incremental import *can lead to Hadoop data corruption* -- if the "next value" data
is not updated, the _subsequent run_ of sqoop will re-import the records from the database,
leading to a duplication of data.

Our case was with 1.4.6 on EMR Hadoop, with an append, having Sqoop metastore on PostgreSQL
(same issue as reported here with MySQL).  Concurrent updates from sqoop jobs running in parallel
sporadically resulted in IOException from metastore DB.

We'll try to do a small patch if we get a chance

> Support for Mysql Sqoop Metastore
> ---------------------------------
>                 Key: SQOOP-2056
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2056
>             Project: Sqoop
>          Issue Type: New Feature
>    Affects Versions: 1.4.5
>            Reporter: Karthic Hariharan
>         Attachments: sqoop-patch.txt
> We would love to see sqoop metastore supported for Mysql.
> At the moment sqoop metastore can be set up only with HSQLdb. Even though you can fake
a mysql database to look like a HSQLdb (refer http://bit.ly/1tz2J5u), it does not translate
to compatibility to all of sqoop's features. 
> Some of the incompatibilities are:
> * Metastore client assumes all connections to the metastore is in serializable transaction
isolation so when sqoop job is executed it never really finishes because it's trying to run
a transaction within a transaction.
> * Incremental loads using last modified timestamp doesnt work because the sqoop job tries
to get the current time on the database which is a different sql command for Hsqldb and mysql.

This message was sent by Atlassian JIRA

View raw message