sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abraham Elmahrek <...@cloudera.com>
Subject Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)
Date Thu, 16 Jan 2014 19:13:36 GMT
Is the data changing during the time you are running the sqoop import? The
--verbose option should put some interesting debug information in your logs
(such as the query being used at execution) and the different WHERE
parameters being used for splitting.

-Abe


On Mon, Jan 13, 2014 at 12:02 PM, yogesh kumar <yogeshsqoop@gmail.com>wrote:

> here was the code I have used ...
>
>
> sqoop import -libjars
>  --driver com.sybase.jdbc3.jdbc.SybDriver \
>  --query "select * from
>  from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \
>
> --check-column Unique_value \
>  --incremental append \
>  --last-value 201401200 \
>  --split-by DEPT \
>  --fields-terminated-by ',' \
>  --target-dir ${TARGET_DIR}/${INC} \
>  --username ${SYBASE_USERNAME} \
>  --password ${SYBASE_PASSWORD} \
>
>
> now I have imported newly inserted data into RDBMS to HDFS
>
> but when I do
>
> select count(*) , unique_value from EMP group by unique_value (both in
> RDBMS and in HIVE)
>
> I can find huge data loss.
>
> 1) in RDBMS
>
>   Count(*)    Unique_value
>   1000          201401201
>    5000         201401202
>   10000         201401203
>
>
> 2) in HIVE
>
>   Count(*)    Unique_value
>   189          201401201
>    421         201401202
>    50           201401203
>
>
> If I do
>
> select Unique value from emp ;
>
> Result :
> 201401201
> 201401201
> 201401201
> 201401201
> 201401201
> .
> .
> 201401202
> .
> .
> and so on...
>
>
> On Tue, Jan 14, 2014 at 1:29 AM, yogesh kumar <yogeshsqoop@gmail.com>wrote:
>
>> Hello Abe,
>>
>> Thanks a zillion for your response,
>>
>> Yes the unique_vale is SAL  over here ..
>>
>> I have an option to test with different directory as its having small
>> amount of data (only 1 month data)
>>
>> Yes the interesting fact is I did sqoop pull on the basis on month, into
>> which I am have not done incremental import, and the data matches very
>> well, but for daily pull I have to do incremental import every day..
>>
>> and then i find some data loss...
>>
>> one of the sqoop guy has suggested me to use  " \$CONDITIONS " with in
>> this query as  ......  from EMP where \$CONDITIONS and SAL > 201401200
>> and SAL <= 201401204 \
>>
>> like
>>
>> EMP where SAL > 201401200 and SAL <= 201401204 and   \$CONDITIONS
>>
>>
>>
>> Plz do suggest me pls help me out as I have to output to my client..
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jan 13, 2014 at 11:13 PM, Abraham Elmahrek <abe@cloudera.com>wrote:
>>
>>> Yogesh,
>>>
>>> Is unique_value in this case SAL? I'm a bit confused about your query.
>>>
>>> Do you have the option of running this query on a separate database
>>> somewhere to find the issue? I think it would be interesting to see the
>>> initial state and then the state after running an incremental import. That
>>> would tell us how many results are being imported after sqoop has ran and
>>> we can validate each step. Also, please use the --verbose flag to get the
>>> most out of the logs.
>>>
>>> -Abe
>>>
>>>
>>> On Mon, Jan 13, 2014 at 5:15 AM, Sharath Punreddy <srpunreddy@gmail.com>wrote:
>>>
>>>> Yogesh,
>>>>
>>>> Please try to put $CONDITIONS after your where clause.
>>>>
>>>> Checkout the examples in the below blog.
>>>>
>>>>
>>>> http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1
>>>>  On Jan 13, 2014 7:04 AM, "yogesh kumar" <yogeshsqoop@gmail.com> wrote:
>>>>
>>>>> Hello Jarcec,
>>>>>
>>>>> I got the issue hope this is the cause..  I got data loss by doing
>>>>> incremental pull
>>>>>
>>>>> I have crossed checked it and found that
>>>>>
>>>>> sqoop import -libjars
>>>>>  --driver com.sybase.jdbc3.jdbc.SybDriver \
>>>>>  --query "select * from
>>>>>  from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204
\
>>>>> --check-column Unique_value \
>>>>>  --incremental append \
>>>>>  --last-value 201401200 \
>>>>>  --split-by DEPT \
>>>>>  --fields-terminated-by ',' \
>>>>>  --target-dir ${TARGET_DIR}/${INC} \
>>>>>  --username ${SYBASE_USERNAME} \
>>>>>  --password ${SYBASE_PASSWORD} \
>>>>>
>>>>>
>>>>> now I have imported newly inserted data into RDBMS to HDFS
>>>>>
>>>>> but when I do
>>>>>
>>>>> select count(*) , unique_value from EMP group by unique_value (both in
>>>>> RDBMS and in HIVE)
>>>>>
>>>>> I can find huge data loss.
>>>>>
>>>>> 1) in RDBMS
>>>>>
>>>>>   Count(*)    Unique_value
>>>>>   1000          201401201
>>>>>    5000         201401202
>>>>>   10000         201401203
>>>>>
>>>>>
>>>>> 2) in HIVE
>>>>>
>>>>>   Count(*)    Unique_value
>>>>>   189          201401201
>>>>>    421         201401202
>>>>>    50           201401203
>>>>>
>>>>>
>>>>> If I do
>>>>>
>>>>> select Unique value from emp ;
>>>>>
>>>>> Result :
>>>>> 201401201
>>>>> 201401201
>>>>> 201401201
>>>>> 201401201
>>>>> 201401201
>>>>> .
>>>>> .
>>>>> 201401202
>>>>> .
>>>>> .
>>>>> and so on...
>>>>>
>>>>>
>>>>> Pls help and suggest why is it so
>>>>>
>>>>>
>>>>> Many thanks in advance
>>>>>
>>>>> Yogesh kumar
>>>>>
>>>>> On Sun, Jan 12, 2014 at 11:08 PM, Jarek Jarcec Cecho <
>>>>> jarcec@apache.org> wrote:
>>>>>
>>>>>> Hi Yogesh,
>>>>>> I would start by verifying imported data. If there are duplicates
>>>>>> than it's suggesting some miss configuration of Sqoop, otherwise
you might
>>>>>> have some inconsistencies down the pipeline.
>>>>>>
>>>>>> Jarcec
>>>>>>
>>>>>> On Sat, Jan 11, 2014 at 11:01:22PM +0530, yogesh kumar wrote:
>>>>>> > Hello All,
>>>>>> >
>>>>>> > I am working on a use case where I have to run a process on
daily
>>>>>> basis
>>>>>> > which will do these.
>>>>>> >
>>>>>> > 1)  Pull every day new data inserted into RDBMS tables to HDFS
>>>>>> > 2)  Having external table in hive (pointing to the location
of HDFS
>>>>>> > directry where data is pulled by sqoop)
>>>>>> > 3) Perform some hive queries (joins) and create a final internal
>>>>>> table into
>>>>>> > Hive (say.. Hive_Table_Final).
>>>>>> >
>>>>>> >
>>>>>> > What I am doing..
>>>>>> >
>>>>>> > I am migrating a process from RDBMS to HADOOP ( same process
is
>>>>>> being
>>>>>> > executed in RDBMS procedure and stored in final table . {say..
>>>>>> >  Rdbms_Table_Final} )
>>>>>> >
>>>>>> > Issue I am facing is.
>>>>>> >
>>>>>> > Every time I do Incremental import and after processing I find
the
>>>>>> final
>>>>>> > table in hive having the value multiplied by every time I do
>>>>>> incremental
>>>>>> > import (If I do incremental import to bring new data into HDFS
,
>>>>>> the data
>>>>>> > in final table of hive after processing  i.e "Hive_Table_Final"
>>>>>>  showing
>>>>>> > the values of all columns multiplied by the times of I done
>>>>>> incremental
>>>>>> > pull), if I do perform incremental import for 4 days ( every
day
>>>>>> once
>>>>>> > incremental import in a day and did it for  4 days) i got  data
>>>>>> multiplied
>>>>>> > 4 in the final table of hive (Hive_Table_Final)  with respect
to
>>>>>> final
>>>>>> > table in RDBMS (Rdbms_final_table).
>>>>>> >
>>>>>> >
>>>>>> > Like..
>>>>>> >
>>>>>> > 1) 1st time I have pulled the data from RDBMS based on the months
>>>>>> (like
>>>>>> > from 2013-12-01 to 2013-01-01) and processed it, got perfect
results
>>>>>> > matching the data in final Hive's  table(Hive_Table_Final) and
RDBMS
>>>>>> > processed data into (Rdbms_Table_Final)
>>>>>> >
>>>>>> > 2) I have done incremental import to bring new data from RDBMS
to
>>>>>> HDFS by
>>>>>> > using this command..
>>>>>> >
>>>>>> >
>>>>>> >  sqoop import -libjars
>>>>>> >  --driver com.sybase.jdbc3.jdbc.SybDriver \
>>>>>> >  --query "select * from
>>>>>> >  from EMP where \$CONDITIONS and SAL > 50000 and SAL <=
80000" \
>>>>>> > --check-column Unique_value \
>>>>>> >  --incremental append \
>>>>>> >  --last-value 201401200 \
>>>>>> >  --split-by DEPT \
>>>>>> >  --fields-terminated-by ',' \
>>>>>> >  --target-dir ${TARGET_DIR}/${INC} \
>>>>>> >  --username ${SYBASE_USERNAME} \
>>>>>> >  --password ${SYBASE_PASSWORD} \
>>>>>> >
>>>>>> > "Note -- The field Unique_value is very unique for every time,
its
>>>>>> > like primary key "
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > As now I have just pulled the new records to my HDFS which were
>>>>>> into  RDBMS
>>>>>> > tables..
>>>>>> >
>>>>>> > Now I got major data mis-match issue,  after the
>>>>>> > processing..(Hive_Table_final)
>>>>>> >
>>>>>> > My Major issue is with sqoop incremental import, as many times
I do
>>>>>> > Incremental import I find the  data into my final table gets
>>>>>> multiplied by
>>>>>> > the times I have done incremental import..
>>>>>> >
>>>>>> >
>>>>>> > Please suggest, whats wrong I am doing, Whats I am missing..
>>>>>> > pls help me out..
>>>>>> >
>>>>>> >
>>>>>> > Thanks & Regards
>>>>>> > Yogesh Kumar
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Mime
View raw message