sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pratik khadloya <tispra...@gmail.com>
Subject Re: Sqoop Incremental mode to get updated data in Hive
Date Fri, 19 Sep 2014 15:18:23 GMT
What other columns do you have in your table which can be used?
Either a last modified timestamp based column or a version id based column
is needed for sqoop to know if something new came along
since the last import. Even if we were to write our own version of sqoop,
we would still need that information.

Another option could have been to monitor the query logs of the RDBMS and
batch the logs and then issue imports. But sqoop does not monitor the query
logs.

~Pratik

On Fri, Sep 19, 2014 at 6:41 AM, Radhe Radhe <radhe.krishna.radhe@live.com>
wrote:

> Hello Experts,
>
> I need to extract data from RDBMS tables to Hive Tables on a timely
> basis(daily/weekly, etc.).
>
> My tables have a '*Primary Key*' but does NOT have a '*last-modified*'
> column.
>
> I plan to go as:
> *FIRST RUN:* Will use sqoop-import-all-tables command for importing all
> the tables at one go to Hive Tables.
> *EACH SUBSEQUENT RUN:* Use Sqoop incremental import mode to retrieve only
> rows newer than some previously-imported set of rows.
>
> My question is how can I get the updated rows which got updated in between
> the *FIRST RUN *and *NEXT SUBSEQUENT RUN*.
> For e.g.
> Say in *FIRST RUN: *I fetched all tables.
> A CUSTOMER table that has 100 records with CustomerId as Primary Key(1 to
> 100) is imported to Hive CUSTOMER table.
>
> And now meanwhile some rows in the Source CUSTOMER table got updated.
> With my *NEXT SUBSEQUENT RUN *this will fetch rows > 100, thus skipping
> the updated rows which are  < 100.
>
> How can I get the updated rows on each subsequent run.
>
> Referring to* Sqoop documentation* this strategy works only if we have a
> last-modified column(*which in my case don't have*):
> An alternate table update strategy supported by Sqoop is called
> lastmodified mode.
> You should use this when rows of the source table may be updated, and each
> such update will set the value of a last-modified column to the current
> timestamp.
> Rows where the check column holds a timestamp more recent than the
> timestamp specified with --last-value are imported.
>
> Any suggestions on how to get the updated data on SUBSEQUENT RUNS using
> the Sqoop Incremental mode?
>
> Thanks,
> -RR
>

Mime
View raw message