sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szabolcs Vasas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-3267) Incremental import to HBase deletes only last version of column
Date Fri, 15 Dec 2017 13:06:00 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292487#comment-16292487
] 

Szabolcs Vasas commented on SQOOP-3267:
---------------------------------------

Hi [~dvoros],

Thank you for reporting this bug, this is a really nice catch!
I have created a couple of test cases to clarify the issue.

1) Updating a column without history in HBase

Let's create the test tables and import the initial values to HBase:

{code:java}
export MYCONN=...
export MYUSER=...

sqoop eval --connect $MYCONN --username $MYUSER --query "drop table hbase_test"
sqoop eval --connect $MYCONN --username $MYUSER --query "create table hbase_test(id int primary
key ,name varchar(30), date_modified DATETIME)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test (id, name,
date_modified) VALUES(1, 'name1', CURRENT_TIMESTAMP)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test (id, name,
date_modified) VALUES(2, 'name2', CURRENT_TIMESTAMP)"

echo "disable 'hbase_test'" | hbase shell
echo "drop 'hbase_test'" | hbase shell
echo "create 'hbase_test', 'data'" | hbase shell

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" --hbase-table hbase_test
--column-family data -m 1

echo "scan 'hbase_test'" | hbase shell

scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    column=data:date_modified, timestamp=1513340516549,
value=2017-12-15 12:17:26.0
 1                                                    column=data:name, timestamp=1513340516549,
value=name1
 2                                                    column=data:date_modified, timestamp=1513340516549,
value=2017-12-15 12:17:29.0
 2                                                    column=data:name, timestamp=1513340516549,
value=name2
2 row(s)

{code}

The output of the HBase scan is the expected, let's update a column to null in the database
and execute an incremental import:

{code:java}
sqoop eval --connect $MYCONN --username $MYUSER --query "update hbase_test set name=null,
date_modified=CURRENT_TIMESTAMP where id = 1"

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" --hbase-table hbase_test
--column-family data -m 1 --incremental lastmodified --check-column date_modified --last-value
"2017-12-15 12:17:26.0" --merge-key id

echo "scan 'hbase_test'" | hbase shell

scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    column=data:date_modified, timestamp=1513340614595,
value=2017-12-15 12:23:00.0
 2                                                    column=data:date_modified, timestamp=1513340614595,
value=2017-12-15 12:17:29.0
 2                                                    column=data:name, timestamp=1513340614595,
value=name2
2 row(s)

{code}

The output of the HBase scan is pretty much the behavior expected after SQOOP-3149, the name
column is updated to null, so it "disappeared" from the output of the scan, if we check that
column in HBase using the Java API we will get null value. So far so good.

2) Updating a column with history in HBase

This test case is really similar to the previous one but first we update the name column to
a non-null value:


{code:java}
export MYCONN=jdbc:mysql://`hostname`/test
export MYUSER=sqoop

sqoop eval --connect $MYCONN --username $MYUSER --query "drop table hbase_test"
sqoop eval --connect $MYCONN --username $MYUSER --query "create table hbase_test(id int primary
key ,name varchar(30), date_modified DATETIME)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test (id, name,
date_modified) VALUES(1, 'name1', CURRENT_TIMESTAMP)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test (id, name,
date_modified) VALUES(2, 'name2', CURRENT_TIMESTAMP)"

echo "disable 'hbase_test'" | hbase shell
echo "drop 'hbase_test'" | hbase shell
echo "create 'hbase_test', 'data'" | hbase shell

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" --hbase-table hbase_test
--column-family data -m 1

echo "scan 'hbase_test'" | hbase shell

scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    column=data:date_modified, timestamp=1513335407294,
value=2017-12-15 10:49:05.0
 1                                                    column=data:name, timestamp=1513335407294,
value=name1
 2                                                    column=data:date_modified, timestamp=1513335407294,
value=2017-12-15 10:49:09.0
 2                                                    column=data:name, timestamp=1513335407294,
value=name2
2 row(s)


sqoop eval --connect $MYCONN --username $MYUSER --query "update hbase_test set name='name_modified',
date_modified=CURRENT_TIMESTAMP where id = 1"

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" --hbase-table hbase_test
--column-family data -m 1 --incremental lastmodified --check-column date_modified --last-value
"2017-12-15 10:49:09.0" --merge-key id

echo "scan 'hbase_test'" | hbase shell


scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    column=data:date_modified, timestamp=1513335602892,
value=2017-12-15 10:58:44.0
 1                                                    column=data:name, timestamp=1513335602892,
value=name_modified
 2                                                    column=data:date_modified, timestamp=1513335602892,
value=2017-12-15 10:49:09.0
 2                                                    column=data:name, timestamp=1513335602892,
value=name2
2 row(s)

{code}

Everything is expected so far, the output of the second HBase scan command shows us the updated
name value. Now let's set that column to null and see what happens in HBase:

{code:java}
sqoop eval --connect $MYCONN --username $MYUSER --query "update hbase_test set name=null,
date_modified=CURRENT_TIMESTAMP where id = 1"

sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test" --hbase-table hbase_test
--column-family data -m 1 --incremental lastmodified --check-column date_modified --last-value
"2017-12-15 10:58:44.0" --merge-key id

echo "scan 'hbase_test'" | hbase shell

scan 'hbase_test'
ROW                                                   COLUMN+CELL
 1                                                    column=data:date_modified, timestamp=1513335804486,
value=2017-12-15 11:01:35.0
 1                                                    column=data:name, timestamp=1513335407294,
value=name1
 2                                                    column=data:date_modified, timestamp=1513335602892,
value=2017-12-15 10:49:09.0
 2                                                    column=data:name, timestamp=1513335602892,
value=name2
2 row(s)
Took 0.4017 seconds
{code}

The output of the last scan shows the bug reported in this JIRA. We can see that the data:date_modified
column is updated in row 1, but the value data:name column is not null but it is "reset" to
its previous value.

I think we all agree that the current behavior is not what we want the question is how we
should address this bug.
I am not a big fan of deleting the full history, since some users might rely on it, if it
is possible we should put a null value to the HBase table as well.

Regards,
Szabolcs

> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>
>                 Key: SQOOP-3267
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3267
>             Project: Sqoop
>          Issue Type: Bug
>          Components: hbase-integration
>    Affects Versions: 1.4.7
>            Reporter: Daniel Voros
>            Assignee: Daniel Voros
>         Attachments: SQOOP-3267.1.patch
>
>
> Deletes are supported since SQOOP-3149, but we're only deleting the last version of a
column when the corresponding cell was set to NULL in the source table.
> This can lead to unexpected and misleading results if the row has been transferred multiple
times, which can easily happen if it's being modified on the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a single Put per
row as before. This could probably lead to a performance drop for wide tables (for which HBase
is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be the expected
behavior here?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message