sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pratik khadloya <tispra...@gmail.com>
Subject Re: Sqoop Incremental mode to get updated data in Hive
Date Fri, 19 Sep 2014 15:51:04 GMT
The docs say that "The merge tool is typically run after an incremental
import with the date-last-modified mode (sqoop import --incremental
lastmodified …)."
Also, I am not sure how the merge tool handles deleted records, looking at
the code of merge tool, i don't think it deletes the corresponding records
from target which were deleted in the source.


On Fri, Sep 19, 2014 at 8:43 AM, Radhe Radhe <radhe.krishna.radhe@live.com>

> Hi Pratik,
> The source RDBMs tables will not always necessarily contain a
> 'last-modified' column.
> I see that 'sqoop-merge' is there:
> The merge tool allows you to combine two datasets where entries in one
> dataset should overwrite entries of an older dataset. For example, an
> incremental import run in last-modified mode will generate multiple
> datasets in HDFS where successively newer data appears in each dataset. The
> mergetool will "flatten" two datasets into one, taking the newest
> available records for each primary key.
> Can 'sqoop-merge' be used to get the updated rows somehow? Will that be a
> viable solution?
> Thanks,
> -RR
> ------------------------------
> From: tispratik@gmail.com
> Date: Fri, 19 Sep 2014 08:18:23 -0700
> Subject: Re: Sqoop Incremental mode to get updated data in Hive
> To: user@sqoop.apache.org
> What other columns do you have in your table which can be used?
> Either a last modified timestamp based column or a version id based column
> is needed for sqoop to know if something new came along
> since the last import. Even if we were to write our own version of sqoop,
> we would still need that information.
> Another option could have been to monitor the query logs of the RDBMS and
> batch the logs and then issue imports. But sqoop does not monitor the query
> logs.
> ~Pratik
> On Fri, Sep 19, 2014 at 6:41 AM, Radhe Radhe <radhe.krishna.radhe@live.com
> > wrote:
> Hello Experts,
> I need to extract data from RDBMS tables to Hive Tables on a timely
> basis(daily/weekly, etc.).
> My tables have a '*Primary Key*' but does NOT have a '*last-modified*'
> column.
> I plan to go as:
> *FIRST RUN:* Will use sqoop-import-all-tables command for importing all
> the tables at one go to Hive Tables.
> *EACH SUBSEQUENT RUN:* Use Sqoop incremental import mode to retrieve only
> rows newer than some previously-imported set of rows.
> My question is how can I get the updated rows which got updated in between
> For e.g.
> Say in *FIRST RUN: *I fetched all tables.
> A CUSTOMER table that has 100 records with CustomerId as Primary Key(1 to
> 100) is imported to Hive CUSTOMER table.
> And now meanwhile some rows in the Source CUSTOMER table got updated.
> With my *NEXT SUBSEQUENT RUN *this will fetch rows > 100, thus skipping
> the updated rows which are  < 100.
> How can I get the updated rows on each subsequent run.
> Referring to* Sqoop documentation* this strategy works only if we have a
> last-modified column(*which in my case don't have*):
> An alternate table update strategy supported by Sqoop is called
> lastmodified mode.
> You should use this when rows of the source table may be updated, and each
> such update will set the value of a last-modified column to the current
> timestamp.
> Rows where the check column holds a timestamp more recent than the
> timestamp specified with --last-value are imported.
> Any suggestions on how to get the updated data on SUBSEQUENT RUNS using
> the Sqoop Incremental mode?
> Thanks,
> -RR

View raw message