hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Rename tables or swap alias
Date Mon, 06 Jun 2016 22:08:40 GMT
We implemented this by upserting changed elements and dropping others. On a given cluster is
takes 4.5 hours to load HBase, the trim and cleanup as currently implemented takes 4 days.
Back to the drawing board.

I’ve read the references but still don’t grok what to do. I have a table with an event
stream, containing duplicates and expired data. I’d like to find the most time-efficient
way to remove duplicates and drop expired data from what I’ll call the main_table. This
is being queried and added to all the time.

My first thought was to create a new clean_table with Spark by reading main_table, processing
and writing clean_table then renaming main_table to old_table, and renaming clean_table to
main_table. I can now drop old_table. Ignoring what happens to events during renaming, this
would be efficient because it would be equivalent to loading, no complex updates to tables
in place and under load. 

Snapshots and clones seem to miss the issue which is writing the cleaned data to some place
that can now act like main_table but clearly I don’t understand snapshots and clones. They
seem to be some way to alias a table so only changes are logged, without actually copying
the data. I’m not sure i care about copying the data into an RDD, which will then undergo
some transforms into a final RDD. This can be written efficiently into clean_table with no
upserts or droping of elements, which seems to be cause things to slow to a halt.

So assuming I have clean_table, how do I get all queries to go to it, instead of main_table?
Elasticsearch has an alias that I can just point somewhere new. Do I need to keep track of
something like this outside of HBase and change it after creating clean_table or am I missing
how to do this with shapshots and clones?

From: Ted Yu <yuzhihong@gmail.com <mailto:yuzhihong@gmail.com>>
Subject: Re: Rename tables or swap alias
Date: February 16, 2016 at 6:48:53 AM PST
To: "user@hbase.apache.org <mailto:user@hbase.apache.org>" <user@hbase.apache.org
Reply-To: user@hbase.apache.org <mailto:user@hbase.apache.org>

Please see http://hbase.apache.org/book.html#ops.snapshots <http://hbase.apache.org/book.html#ops.snapshots>
for background
on snapshots.

In Anil's description, table_old is the result of cloning the snapshot
which is taken in step #1. See
http://hbase.apache.org/book.html#ops.snapshots.clone <http://hbase.apache.org/book.html#ops.snapshots.clone>


On Tue, Feb 16, 2016 at 6:35 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

> I think I can work out the algorithm if I knew precisely what a “snapshot"
> does. From my reading it seems to be a lightweight fast alias (for lack of
> a better word) since it creates something that refers to the same physical
> data.So if I create a new table with cleaned data, call it table_new. Then
> I drop table_old and “snapshot” table_new into table_old? Is this what is
> suggested?
> This leaves me with a small time where there is no table_old, which is the
> time between dropping table_old and creating a snapshot. Is it feasible to
> lock the DB for this time?
>> On Feb 15, 2016, at 7:13 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> Keep in mind that if the writes to this table are not paused, there would
>> be some data coming in between steps #1 and #2 which would not be in the
>> snapshot.
>> Cheers
>> On Mon, Feb 15, 2016 at 6:21 PM, Anil Gupta <anilgupta84@gmail.com>
> wrote:
>>> I dont think there is any atomic operations in hbase to support ddl
> across
>>> 2 tables.
>>> But, maybe you can use hbase snapshots.
>>> 1.Create a hbase snapshot.
>>> 2.Truncate the table.
>>> 3.Write data to the table.
>>> 4.Create a table from snapshot taken in step #1 as table_old.
>>> Now you have two tables. One with current run data and other with last
> run
>>> data.
>>> I think above process will suffice. But, keep in mind that it is not
>>> atomic.
>>> HTH,
>>> Anil
>>> Sent from my iPhone
>>>> On Feb 15, 2016, at 4:25 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
>>>> Any other way to do what I was asking. With Spark this is a very normal
>>> thing to treat a table as immutable and create another to replace the
> old.
>>>> Can you lock two tables and rename them in 2 actions then unlock in a
>>> very short period of time?
>>>> Or an alias for table names?
>>>> Didn’t see these in any docs or Googling, any help is appreciated.
>>> Writing all this data back to the original table would be a huge load
> on a
>>> table being written to by external processes and therefore under large
> load
>>> to begin with.
>>>>> On Feb 14, 2016, at 5:03 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>> There is currently no native support for renaming two tables in one
>>> atomic
>>>>> action.
>>>>> FYI
>>>>>> On Sun, Feb 14, 2016 at 4:18 PM, Pat Ferrel <pat@occamsmachete.com>
>>> wrote:
>>>>>> I use Spark to take an old table, clean it up to create an RDD of
>>> cleaned
>>>>>> data. What I’d like to do is write all of the data to a new table
>>> HBase,
>>>>>> then rename the table to the old name. If possible it could be done
> by
>>>>>> changing an alias to point to the new table as long as all external
>>> code
>>>>>> uses the alias, or by a 2 table rename operation. But I don’t see
>>> to do
>>>>>> this for HBase. I am dealing with a lot of data so don’t want to
>>> table
>>>>>> modifications with deletes and upserts, this would be incredibly
> slow.
>>>>>> Furthermore I don’t want to disable the table for more than a tiny
>>> span of
>>>>>> time.
>>>>>> Is it possible to have 2 tables and rename both in an atomic action,
> or
>>>>>> change some alias to point to the new table in an atomic action.
> not
>>>>>> what is the quickest way to achieve this to minimize time disabled.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message