sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yulei Yang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SQOOP-3263) Duplicate rows found when split-by column is of textual type due to different charset difference of sqoop and hadoop
Date Sun, 26 Nov 2017 15:56:00 GMT

     [ https://issues.apache.org/jira/browse/SQOOP-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yulei Yang updated SQOOP-3263:
------------------------------
    Attachment: sqoop-3263.patch

> Duplicate rows found when split-by column is of textual type due to different charset
difference of sqoop and hadoop
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SQOOP-3263
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3263
>             Project: Sqoop
>          Issue Type: Bug
>    Affects Versions: 1.4.6
>            Reporter: Yulei Yang
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png,
sqoop-3263.patch
>
>
> This is issue can be found in any kind of RMDBS, because the root cause is not on RMDBS.
Steps to reproduce this issue:
> 1. create a mysql table: create table ora_test (id varchar(32) primary key not null);
> 2.  insert *5* rows:
> insert into ora_test values ('08125FC4C8FDA064E053C0A8028DA064');
> insert into ora_test values ('4FFE68419D3502E2E0537F000001F3E8');
> insert into ora_test values ('4FFF9CF5861E003EE0537F0000017FF7');
> insert into ora_test values ('56DAC2D0F14901B0E0537F000001D3FA');
> insert into ora_test values ('4 ABC');
> 3. import it to hive with sqoop import -m 32. (m=189 is also ok)。 Then you will get
*7* rows in hive. Check screenshot-1.png
> part-32 is duplicated with part-26.
> so I print their split boundary values in unicode and plain text, check screenshot-2.png
for part-26, screenshot-3.png for part-32.
> According to boundary values, we can know that part-26 has no problem while part-32 is
wrong, because '\u4\ud836' is larger than ‘4F', so part-32 should have no records.
> So '?' in plain text of part-32 is suspicious, does its unicode is still '\ud836' when
query on RMDBS?
> So I do next test, check screenshot-4.png. Two different unicode characters are mapped
to a same character in utf-8.
> This caused the duplication.
> How is happens?
> 1. split boundary values are unicode
> 2. when the import MR start to run, it read split boundary values to Text type. Text
always use utf-8, so some characters are wrong, like above case. 
> My solution is convert sqoop generated split boundary values to utf-8 first, and resort
them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message