here was the code I have used ...

sqoop import -libjars
 --driver com.sybase.jdbc3.jdbc.SybDriver \
 --query "select * from
 from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \

--check-column Unique_value \
 --incremental append \
 --last-value 201401200 \
 --split-by DEPT \
 --fields-terminated-by ',' \
 --target-dir ${TARGET_DIR}/${INC} \
 --username ${SYBASE_USERNAME} \
 --password ${SYBASE_PASSWORD} \
 
 
now I have imported newly inserted data into RDBMS to HDFS
 
but when I do 
 
select count(*) , unique_value from EMP group by unique_value (both in RDBMS and in HIVE)
 
I can find huge data loss.
 
1) in RDBMS
 
  Count(*)    Unique_value
  1000          201401201
   5000         201401202
  10000         201401203
 
 
2) in HIVE
 
  Count(*)    Unique_value 
  189          201401201
   421         201401202
   50           201401203
 
 
If I do
 
select Unique value from emp ;
 
Result : 
201401201
201401201
201401201
201401201
201401201
.
.
201401202
.
.
and so on...


On Tue, Jan 14, 2014 at 1:29 AM, yogesh kumar <yogeshsqoop@gmail.com> wrote:
Hello Abe, 

Thanks a zillion for your response,

Yes the unique_vale is SAL  over here ..

I have an option to test with different directory as its having small amount of data (only 1 month data) 

Yes the interesting fact is I did sqoop pull on the basis on month, into which I am have not done incremental import, and the data matches very well, but for daily pull I have to do incremental import every day..

and then i find some data loss...

one of the sqoop guy has suggested me to use  " \$CONDITIONS " with in this query as  ......  from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \

like 

EMP where SAL > 201401200 and SAL <= 201401204 and   \$CONDITIONS 



Plz do suggest me pls help me out as I have to output to my client..








On Mon, Jan 13, 2014 at 11:13 PM, Abraham Elmahrek <abe@cloudera.com> wrote:
Yogesh,

Is unique_value in this case SAL? I'm a bit confused about your query.

Do you have the option of running this query on a separate database somewhere to find the issue? I think it would be interesting to see the initial state and then the state after running an incremental import. That would tell us how many results are being imported after sqoop has ran and we can validate each step. Also, please use the --verbose flag to get the most out of the logs.

-Abe


On Mon, Jan 13, 2014 at 5:15 AM, Sharath Punreddy <srpunreddy@gmail.com> wrote:

Yogesh,

Please try to put $CONDITIONS after your where clause.

Checkout the examples in the below blog.

http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1

On Jan 13, 2014 7:04 AM, "yogesh kumar" <yogeshsqoop@gmail.com> wrote:
Hello Jarcec,
 
I got the issue hope this is the cause..  I got data loss by doing incremental pull
 
I have crossed checked it and found that
 
sqoop import -libjars
 --driver com.sybase.jdbc3.jdbc.SybDriver \
 --query "select * from
 from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \
--check-column Unique_value \
 --incremental append \
 --last-value 201401200 \
 --split-by DEPT \
 --fields-terminated-by ',' \
 --target-dir ${TARGET_DIR}/${INC} \
 --username ${SYBASE_USERNAME} \
 --password ${SYBASE_PASSWORD} \
 
 
now I have imported newly inserted data into RDBMS to HDFS
 
but when I do
 
select count(*) , unique_value from EMP group by unique_value (both in RDBMS and in HIVE)
 
I can find huge data loss.
 
1) in RDBMS
 
  Count(*)    Unique_value
  1000          201401201
   5000         201401202
  10000         201401203
 
 
2) in HIVE
 
  Count(*)    Unique_value
  189          201401201
   421         201401202
   50           201401203
 
 
If I do
 
select Unique value from emp ;
 
Result :
201401201
201401201
201401201
201401201
201401201
.
.
201401202
.
.
and so on...
 
 
Pls help and suggest why is it so
 
 
Many thanks in advance
 
Yogesh kumar

On Sun, Jan 12, 2014 at 11:08 PM, Jarek Jarcec Cecho <jarcec@apache.org> wrote:
Hi Yogesh,
I would start by verifying imported data. If there are duplicates than it's suggesting some miss configuration of Sqoop, otherwise you might have some inconsistencies down the pipeline.

Jarcec

On Sat, Jan 11, 2014 at 11:01:22PM +0530, yogesh kumar wrote:
> Hello All,
>
> I am working on a use case where I have to run a process on daily basis
> which will do these.
>
> 1)  Pull every day new data inserted into RDBMS tables to HDFS
> 2)  Having external table in hive (pointing to the location of HDFS
> directry where data is pulled by sqoop)
> 3) Perform some hive queries (joins) and create a final internal table into
> Hive (say.. Hive_Table_Final).
>
>
> What I am doing..
>
> I am migrating a process from RDBMS to HADOOP ( same process is being
> executed in RDBMS procedure and stored in final table . {say..
>  Rdbms_Table_Final} )
>
> Issue I am facing is.
>
> Every time I do Incremental import and after processing I find the final
> table in hive having the value multiplied by every time I do incremental
> import (If I do incremental import to bring new data into HDFS , the data
> in final table of hive after processing  i.e "Hive_Table_Final"  showing
> the values of all columns multiplied by the times of I done incremental
> pull), if I do perform incremental import for 4 days ( every day once
> incremental import in a day and did it for  4 days) i got  data multiplied
> 4 in the final table of hive (Hive_Table_Final)  with respect to final
> table in RDBMS (Rdbms_final_table).
>
>
> Like..
>
> 1) 1st time I have pulled the data from RDBMS based on the months (like
> from 2013-12-01 to 2013-01-01) and processed it, got perfect results
> matching the data in final Hive's  table(Hive_Table_Final) and RDBMS
> processed data into (Rdbms_Table_Final)
>
> 2) I have done incremental import to bring new data from RDBMS to HDFS by
> using this command..
>
>
>  sqoop import -libjars
>  --driver com.sybase.jdbc3.jdbc.SybDriver \
>  --query "select * from
>  from EMP where \$CONDITIONS and SAL > 50000 and SAL <= 80000" \
> --check-column Unique_value \
>  --incremental append \
>  --last-value 201401200 \
>  --split-by DEPT \
>  --fields-terminated-by ',' \
>  --target-dir ${TARGET_DIR}/${INC} \
>  --username ${SYBASE_USERNAME} \
>  --password ${SYBASE_PASSWORD} \
>
> "Note -- The field Unique_value is very unique for every time, its
> like primary key "
>
>
>
> As now I have just pulled the new records to my HDFS which were into  RDBMS
> tables..
>
> Now I got major data mis-match issue,  after the
> processing..(Hive_Table_final)
>
> My Major issue is with sqoop incremental import, as many times I do
> Incremental import I find the  data into my final table gets multiplied by
> the times I have done incremental import..
>
>
> Please suggest, whats wrong I am doing, Whats I am missing..
> pls help me out..
>
>
> Thanks & Regards
> Yogesh Kumar