sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Attila Szabo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-3267) Incremental import to HBase deletes only last version of column
Date Fri, 15 Dec 2017 14:18:00 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292590#comment-16292590

Attila Szabo commented on SQOOP-3267:

Hey [~dvoros], [~vasas],

IMHO I would keep the history by default, and if the (b/c of the existing cmd line arguments,
and b/c as an end user I would really get my data deleted without explicitly requesting that).

Aggregating your findings and my thoughts my recommendations are the following:
By default (no other options present) I would insert null value, and keep the history.
If the mode aims for the last modified entry only, I would delete the history, and only keep
the last meaningful value (and of course in case of null value delete the column as you've
suggested). I would definitely go with this direction, b/c we're speaking about incremental
mode, and according to the existing doucmentation 'mode' is related to incremental mode (and
we did not made any differentiation for incremental mode with append only tables and incremental
mode for HBase where we can do "real" modificaitons).

Though if you dislike using and leveraging from the mode cmd line argument, I'm not against
to introduce new cmd line arguments on this front, for making it straightfwd, when we do deletes,
when we insert null values, when we keep history and when we do not. Although in this case
I would also highly recommend to introduce some fail fast scenario (form 1.5 version) which
would give a meaningful error message in case of mode+HBase table+incremental import.

My 2cents,

ps.: [~vasas] your test cases are very well defined, and very detailed! Nice job!!!

> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>                 Key: SQOOP-3267
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3267
>             Project: Sqoop
>          Issue Type: Bug
>          Components: hbase-integration
>    Affects Versions: 1.4.7
>            Reporter: Daniel Voros
>            Assignee: Daniel Voros
>         Attachments: SQOOP-3267.1.patch
> Deletes are supported since SQOOP-3149, but we're only deleting the last version of a
column when the corresponding cell was set to NULL in the source table.
> This can lead to unexpected and misleading results if the row has been transferred multiple
times, which can easily happen if it's being modified on the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a single Put per
row as before. This could probably lead to a performance drop for wide tables (for which HBase
is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be the expected
behavior here?

This message was sent by Atlassian JIRA

View raw message