sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Venkat Ranganathan <vranganat...@hortonworks.com>
Subject Re: Sqoop Incremental Import with lastmodified mode giving Duplicate rows for updated rows
Date Tue, 07 Oct 2014 06:55:15 GMT
Those can be ignored (it looks like those are environment variables that
are set elsewhere to control).   In any case, I have followed up on the
hortonworks forum

Venkat

On Mon, Oct 6, 2014 at 9:53 PM, Nirmal Kumar <nirmal.kumar@impetus.co.in>
wrote:

>  Thanks Venkat,
>
>  I have posted this in the Hortonworks Forums as well:
>
> http://hortonworks.com/community/forums/topic/sqoop-incremental-import-lastmodified-giving-duplicate-rows-for-updated-rows/
>
>  I tried the HDP documentation for incremental import.
> But the info given in the HDP documentation is very less.
>
>  The below  info is that all given in the *bk_HortonworksConnectorForTeradata.pdf
> *for incremental import.
>
>  *Incremental Import*
> Teradata incremental import emulates the check-column and last value
> options. Here is an
> example for a table which has 'hire_date' as the date column to check
> against and 'name' as
> the column that can be used to partition the data.
>
>  export USER=dbc
> export PASS=dbc
> export HOST=<dbhost>
> export DB=<dbuser>
> export TABLE=<dbtable>
> export JDBCURL=jdbc:teradata://$HOST/DATABASE=$DB
> export IMPORT_DIR=<hdfs-dir to import>
> export VERBOSE=--verbose
> export MANAGER=org.apache.sqoop.teradata.TeradataConnManager
> export CONN_MANAGER="--connection-manager $MANAGER"
> export CONNECT="--connect $JDBCURL"
> MAPPERS="--num-mappers 4"
> DATE="'1990-12-31'"
> FORMAT="'yyyy-mm-dd'"
> LASTDATE="cast($DATE as date format $FORMAT)"
> SQOOPQUERY="select * from employees where hire_date < $LASTDATE AND \
> $CONDITIONS"
> $SQOOP_HOME/bin/sqoop import $TDQUERY $TDSPLITBY $INPUTMETHOD $VERBOSE
> $CONN_MANAGER $CONNECT -query "$SQOOPQUERY" --username $USER --password
> $PASS
> --target-dir $IMPORT_DIR --split-by name
>
>  Values of $TDQUERY $TDSPLITBY $INPUTMETHOD are confusing.
>
>  Is there some more info about Incremental Imports in HDP ?
>
>  Thanks,
> -Nirmal
>
>
>  ------------------------------
> *From:* Venkat Ranganathan <vranganathan@hortonworks.com>
> *Sent:* Monday, October 6, 2014 10:35 PM
> *To:* user@sqoop.apache.org
> *Subject:* Re: Sqoop Incremental Import with lastmodified mode giving
> Duplicate rows for updated rows
>
>  Hi Nirmal
>
>  hdp connector for TD is a HDP specific work.  Please use the vendor
> forums for this.
> Last modified is not supported as specified in Sqoop.   The HDP
> documentation has an example of doing this using queries.  Please look at
> the documentation
>
>  Thanks
>
>  Venkat
>
> On Mon, Oct 6, 2014 at 5:29 AM, Nirmal Kumar <nirmal.kumar@impetus.co.in>
> wrote:
>
>>  Hi All,
>>
>>  I’m trying to do an Incremental Import using Sqoop from Teradata to
>> Hive tables.
>>
>>  I’m using:
>> -Apache Hadoop 2.4.0
>> -Apache Hive 0.13.1
>> -Apache Sqoop 1.4.4
>> -hdp-connector-for-teradata-1.3.2.2.1.5.0-695-distro
>> -Teradata 15.0.0.8
>>
>>  *From Sqoop documentation:*
>> *An alternate table update strategy supported by Sqoop is called
>> lastmodified mode. You should use this when rows of the source table may be
>> updated, and each such update will set the value of a last-modified column
>> to the current timestamp. Rows where the check column holds a timestamp
>> more recent than the timestamp specified with –last-value are imported.*
>>
>>  I followed the below steps:
>>
>>  *STEP 1*: One time activity
>> I’m doing a full import of the table to hive table.
>>
>>  *STEP 2*: One time activity
>> Created a Sqoob Job for incremental import
>> sqoop job –create incr1 — import –connection-manager
>> org.apache.sqoop.teradata.TeradataConnManager –connect jdbc:teradata://
>> 192.168.199.137/testdb123 –username testdb123 –password testdb123 –table
>> Paper_STAGE –incremental lastmodified –check-column last_modified_col
>> –last-value “2014-10-03 15:29:48.66″ –split-by id –hive-table paper_stage
>> –hive-import
>>
>>  *STEP 3*: This will be done on timely basis from any Scheduler OR Oozie
>> Executing the Sqoob Job for incremental import everytime I need the
>> updated rows/newly added rows.
>> sqoop job –exec incr1
>>
>>  The source table has a “unique primary key” and “last modified column”
>> with current timestamp.
>> The newly added rows though are working fine and getting imported but for
>> the updated rows I’m getting duplicate rows.
>> Sqoop is not updating the updated rows but adding a new one with same Id
>> and new current timestamp.
>>
>>  Is this something which is currently not supported in Sqoop as of now ?
>> This is since I found these:
>>
>>
>> http://stackoverflow.com/questions/19093417/sqoop-import-lastmodified-gives-duplicate-records-it-doesnt-merger
>>
>> http://grokbase.com/p/cloudera/cdh-user/13a4n03jrh/sqoop-import-lastmodified-gives-duplicate-records-merger-does-not-happen
>>
>> https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/xAbXEduvahU
>> https://issues.cloudera.org/browse/DISTRO-464
>>
>>  Is there a way to avoid the duplicate rows for the updated rows and get
>> a merged updated row for each updated row in the Source table?
>> Kindly advise me any alternatives to handle this.
>>
>>  Thanks,
>> -Nirmal
>>
>> ------------------------------
>>
>>
>>
>>
>>
>>
>> NOTE: This message may contain information that is confidential,
>> proprietary, privileged or otherwise protected by law. The message is
>> intended solely for the named addressee. If received in error, please
>> destroy and notify the sender. Any use of this email is prohibited when
>> received in error. Impetus does not represent, warrant and/or guarantee,
>> that the integrity of this communication has been maintained nor that the
>> communication is free of errors, virus, interception or interference.
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
> ------------------------------
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message