sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yogesh kumar <yogeshsq...@gmail.com>
Date Sat, 11 Jan 2014 17:31:22 GMT
Hello All,

I am working on a use case where I have to run a process on daily basis
which will do these.

1)  Pull every day new data inserted into RDBMS tables to HDFS
2)  Having external table in hive (pointing to the location of HDFS
directry where data is pulled by sqoop)
3) Perform some hive queries (joins) and create a final internal table into
Hive (say.. Hive_Table_Final).

What I am doing..

I am migrating a process from RDBMS to HADOOP ( same process is being
executed in RDBMS procedure and stored in final table . {say..
 Rdbms_Table_Final} )

Issue I am facing is.

Every time I do Incremental import and after processing I find the final
table in hive having the value multiplied by every time I do incremental
import (If I do incremental import to bring new data into HDFS , the data
in final table of hive after processing  i.e "Hive_Table_Final"  showing
the values of all columns multiplied by the times of I done incremental
pull), if I do perform incremental import for 4 days ( every day once
incremental import in a day and did it for  4 days) i got  data multiplied
4 in the final table of hive (Hive_Table_Final)  with respect to final
table in RDBMS (Rdbms_final_table).


1) 1st time I have pulled the data from RDBMS based on the months (like
from 2013-12-01 to 2013-01-01) and processed it, got perfect results
matching the data in final Hive's  table(Hive_Table_Final) and RDBMS
processed data into (Rdbms_Table_Final)

2) I have done incremental import to bring new data from RDBMS to HDFS by
using this command..

 sqoop import -libjars
 --driver com.sybase.jdbc3.jdbc.SybDriver \
 --query "select * from
 from EMP where \$CONDITIONS and SAL > 50000 and SAL <= 80000" \
--check-column Unique_value \
 --incremental append \
 --last-value 201401200 \
 --split-by DEPT \
 --fields-terminated-by ',' \
 --target-dir ${TARGET_DIR}/${INC} \
 --username ${SYBASE_USERNAME} \
 --password ${SYBASE_PASSWORD} \

"Note -- The field Unique_value is very unique for every time, its
like primary key "

As now I have just pulled the new records to my HDFS which were into  RDBMS

Now I got major data mis-match issue,  after the

My Major issue is with sqoop incremental import, as many times I do
Incremental import I find the  data into my final table gets multiplied by
the times I have done incremental import..

Please suggest, whats wrong I am doing, Whats I am missing..
pls help me out..

Thanks & Regards
Yogesh Kumar

View raw message