sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szabolcs Vasas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-3267) Incremental import to HBase deletes only last version of column
Date Wed, 24 Jan 2018 10:06:00 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337301#comment-16337301

Szabolcs Vasas commented on SQOOP-3267:

*re: "or every column, but I've already addressed this issue in [^SQOOP-3267.1.patch] (see
first comment on this issue)."*

Sorry, I have missed this, it is a nice improvement!

Even if we ignore the slight performance overhead the problem with the default null string
could be that the output HBase table of a regular import would be different (we would get
defined columns with empty strings instead of undefined columns) and this behavior change
is a bit unexpected from a bug JIRA. It would solve this particular bug but could lead to
confusion in the future.

I am not sure I understand how you would split up the work between the two JIRAs and I wasn't
really clear in my previous comment so let me summarize what I suggest:
 * This JIRA would add the --hbase-null-incremental-mode option with two possible values:
ignore(default) and delete. This would basically restore the behavior we had prior to SQOOP-3149
but it would keep the intended functionality introduced by it. It would be a pretty much
localized change we would not affect users who do not even do incremental imports.
 * Another JIRA would introduce a new possible value (null-string) to --hbase-null-incremental-mode
and a new option --hbase-null-string to specify its value. I think this change should be classified
as a new feature. --hbase-null-string could be usable with regular imports too, but if the
user does not specify it we should stick to the current behavior and not insert any null string
to the columns which have nulls in the RDBMS.


> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>                 Key: SQOOP-3267
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3267
>             Project: Sqoop
>          Issue Type: Bug
>          Components: hbase-integration
>    Affects Versions: 1.4.7
>            Reporter: Daniel Voros
>            Assignee: Daniel Voros
>            Priority: Major
>         Attachments: SQOOP-3267.1.patch
> Deletes are supported since SQOOP-3149, but we're only deleting the last version of a
column when the corresponding cell was set to NULL in the source table.
> This can lead to unexpected and misleading results if the row has been transferred multiple
times, which can easily happen if it's being modified on the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a single Put per
row as before. This could probably lead to a performance drop for wide tables (for which HBase
is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be the expected
behavior here?

This message was sent by Atlassian JIRA

View raw message