sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szabolcs Vasas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-3267) Incremental import to HBase deletes only last version of column
Date Tue, 23 Jan 2018 15:42:00 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335941#comment-16335941
] 

Szabolcs Vasas commented on SQOOP-3267:
---------------------------------------

Hi [~dvoros],

Option B seems to be a good direction to me, I agree that ideally the target HBase table should
reflect that a column is set to null in the source RDBMS and I would not make this dependant
on incremental mode since in theory only "lastmodified" mode should change already existing
rows in the target table.
However after thinking about this more thouroughly a performance related questions have arisen.
Let's say the users want to import new rows (so it would be a regular import not an incremental
one) from a wide table where most of the columns are nulls only a couple of values are defined.
In this case the current implementation would use only a few Put commands but the suggested
implementation would need significantly more Put commands just to add the null strings to
the HBase table. I think this is something the users would not prefer in this case. On the
other hand you are right that it would be great if we could keep the consistency of how we
represent nulls in the HBase table and it would not be different in case of regular import
and incremental import...
Considering the above I suggest the following solution:
 * Introduce an --hbase-null-incremental-mode(or similar name) option which would enable the
users to specify what should Sqoop do with the null values in the source RDBMS table. The
options could be:
 ** ignore (default) - This would be basically the behavior before SQOOP-3149
 ** delete - This would be similar to the behavior introduced in SQOOP-3149 but we would delete
the whole history
 ** null-string - Sqoop would put a null string value instead of null specified in the new
--hbase-null-string option
 * Introduce a new option called --hbase-null-string which could be used to specify which
null string Sqoop should put into the HBase table instead of null. This could be used for
the regular imports too but if it is not specified Sqoop should not use null strings to avoid
the above mentioned potential performance problem.

The benefit of this solution would be that the users would have more possibilities to control
how the null values are handled and it would not change the behavior unexpectedly (I might
be paranoid but I feel introducing the new --hbase-null-string is safer than overloading the
already existing --null-string).

Implementing this might be an overkill for addressing this bug we could move the null-string
handling part to another Jira as well.

 

> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>
>                 Key: SQOOP-3267
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3267
>             Project: Sqoop
>          Issue Type: Bug
>          Components: hbase-integration
>    Affects Versions: 1.4.7
>            Reporter: Daniel Voros
>            Assignee: Daniel Voros
>            Priority: Major
>         Attachments: SQOOP-3267.1.patch
>
>
> Deletes are supported since SQOOP-3149, but we're only deleting the last version of a
column when the corresponding cell was set to NULL in the source table.
> This can lead to unexpected and misleading results if the row has been transferred multiple
times, which can easily happen if it's being modified on the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a single Put per
row as before. This could probably lead to a performance drop for wide tables (for which HBase
is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be the expected
behavior here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message