sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Suiter RDX <dsui...@rdx.com>
Subject Incremental lastmodified with external tables
Date Thu, 06 Mar 2014 16:53:49 GMT
Hello all,

I am working on setting up incremental imports on a table that will update
on a Last Modified Date column as the check column.

In my testing, I have created an external table, and imported sample tables
with the job created thus:

sqoop job --create import_testing_users -- import --connect
--table Users --warehouse-dir /user/blah/Reports/testing --append
--fields-terminated-by '\001' --incremental lastmodified --check-column

And this seems to be picking up the updates properly.

However, when I update the timestamps in the check column, I end up with
new rows that have the new data, and old rows that still have the old data,
and my SELECT statements are picking up multiple rows. I'm sure I could
select where last_modified = max(last_modified) but it's not ideal.

I've found mention of this as a problem, and have not found a solution
other than to put the new values into a side directory and run sqoop  merge
on the data to flatten it.

I'd like to automate this in a shell script, and was wondering if there is
some better way than to run a merge operation in the script.

Is the merge still the best solution for this situation?

Also, does importing into a external table have an impact on this operation?

Incidentally, I found it necessary to add the "--warehouse-dir
\path\to\file\ --append" argument on this job, whereas the jobs I set up
for a MySQL import (they are --incremental append" jobs) did not need it. I
was just wondering if that was a driver difference, or a job configuration

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

View raw message