sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Voros (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-3267) Incremental import to HBase deletes only last version of column
Date Wed, 06 Dec 2017 11:11:00 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280029#comment-16280029

Daniel Voros commented on SQOOP-3267:

[~maugli] thanks for your response. Is the intention behind append mode to keep the history?
I thought it's the mode to use when importing an append-only table where you're only creating
new records but never change the existing ones. Thus I thought changes (and so deletes) never
happen when you're using append mode with the correct last-value. Am I missing something here?

I think it's usually a bad idea to delete only the last version of a column, since then a
simple "get" in hbase might return an inconsistent state (one that never existed on the source
side). If we are to keep history we should probably put null (or empty string) values instead
of deleting.

Please let me know what you think!

> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>                 Key: SQOOP-3267
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3267
>             Project: Sqoop
>          Issue Type: Bug
>          Components: hbase-integration
>    Affects Versions: 1.4.7
>            Reporter: Daniel Voros
>            Assignee: Daniel Voros
>         Attachments: SQOOP-3267.1.patch
> Deletes are supported since SQOOP-3149, but we're only deleting the last version of a
column when the corresponding cell was set to NULL in the source table.
> This can lead to unexpected and misleading results if the row has been transferred multiple
times, which can easily happen if it's being modified on the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a single Put per
row as before. This could probably lead to a performance drop for wide tables (for which HBase
is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be the expected
behavior here?

This message was sent by Atlassian JIRA

View raw message