sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radhe Radhe <radhe.krishna.ra...@live.com>
Subject RE: Sqoop Incremental mode to get updated data in Hive
Date Fri, 19 Sep 2014 17:22:47 GMT
Hi Pratik,
Saying that means there is no straight way to get the updated rows in an increment mode where
the table does not have a 'last-modified' column.
The ONLY way the increment mode will work is with the 'append' mode in case the table does
NOT have 'last-modified' column.
You should specify append mode when importing a table where new rows are continually being
added with increasing row id values. You specify the column containing the row's id with --check-column.
Sqoop imports rows where the check column has a value greater than the one specified with
Also Hive does not support the notion of Updating/Deleting rows.https://issues.apache.org/jira/browse/HIVE-5317
The append mode will not help in my case as I want the updated rows in the Target Hive tables
via the Sqoop import increment mode.
I believe that there has to be some work around for this as this is one of the most basic
scenarios for an increment import from Source to Target.

From: tispratik@gmail.com
Date: Fri, 19 Sep 2014 08:51:04 -0700
Subject: Re: Sqoop Incremental mode to get updated data in Hive
To: user@sqoop.apache.org

The docs say that "The merge tool is typically run after an incremental import with the date-last-modified
mode (sqoop import --incremental lastmodified …)."
Also, I am not sure how the merge tool handles deleted records, looking at the code of merge
tool, i don't think it deletes the corresponding records from target which were deleted in
the source.
On Fri, Sep 19, 2014 at 8:43 AM, Radhe Radhe <radhe.krishna.radhe@live.com> wrote:

Hi Pratik,
The source RDBMs tables will not always necessarily contain a 'last-modified' column.
I see that 'sqoop-merge' is there:The merge tool allows you to combine two datasets where
entries in one dataset should overwrite entries of an older dataset. For example, an incremental
import run in last-modified mode will generate multiple datasets in HDFS where successively
newer data appears in each dataset. The mergetool will "flatten" two datasets into one, taking
the newest available records for each primary key.

Can 'sqoop-merge' be used to get the updated rows somehow? Will that be a viable solution?
From: tispratik@gmail.com
Date: Fri, 19 Sep 2014 08:18:23 -0700
Subject: Re: Sqoop Incremental mode to get updated data in Hive
To: user@sqoop.apache.org

What other columns do you have in your table which can be used?Either a last modified timestamp
based column or a version id based column is needed for sqoop to know if something new came
alongsince the last import. Even if we were to write our own version of sqoop, we would still
need that information.
Another option could have been to monitor the query logs of the RDBMS and batch the logs and
then issue imports. But sqoop does not monitor the query logs.
On Fri, Sep 19, 2014 at 6:41 AM, Radhe Radhe <radhe.krishna.radhe@live.com> wrote:

Hello Experts,
I need to extract data from RDBMS tables to Hive Tables on a timely basis(daily/weekly, etc.).
My tables have a 'Primary Key' but does NOT have a 'last-modified' column.
I plan to go as:FIRST RUN: Will use sqoop-import-all-tables command for importing all the
tables at one go to Hive Tables.EACH SUBSEQUENT RUN: Use Sqoop incremental import mode to
retrieve only rows newer than some previously-imported set of rows.
My question is how can I get the updated rows which got updated in between the FIRST RUN and
NEXT SUBSEQUENT RUN.For e.g. Say in FIRST RUN: I fetched all tables. A CUSTOMER table that
has 100 records with CustomerId as Primary Key(1 to 100) is imported to Hive CUSTOMER table.
And now meanwhile some rows in the Source CUSTOMER table got updated.With my NEXT SUBSEQUENT
RUN this will fetch rows > 100, thus skipping the updated rows which are  < 100. 
How can I get the updated rows on each subsequent run.
Referring to Sqoop documentation this strategy works only if we have a last-modified column(which
in my case don't have):An alternate table update strategy supported by Sqoop is called lastmodified
mode. You should use this when rows of the source table may be updated, and each such update
will set the value of a last-modified column to the current timestamp. Rows where the check
column holds a timestamp more recent than the timestamp specified with --last-value are imported.
Any suggestions on how to get the updated data on SUBSEQUENT RUNS using the Sqoop Incremental


View raw message