hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Remove the row in MR job?
Date Fri, 12 Oct 2012 19:47:30 GMT
Hi Doug,

Thanks for the suggestion. I like the idea of simply deleting the
table, however, I'm not sure if I can implement it.

Basically, I have one process which is constantly feeding the table,
and, once a day, I want to run a MR job to proccess this table (Which
will emtpy it).

While I'm processing it, I still want to other process to have the
ability to store data.

Since I can't rename the table because this functionnaly doesn't
exist, I need to have the 2 working on the same table.

Maybe what I can do is working on the colum name.... Like I store on a
different column every day based on the day number and I just run MR
on all the columns except today. After that, I can delete all the
columns except the one for the current day. Issue is if the MR is
taking more than 24h...

Also, is that fast to delete a column?


2012/10/12 Doug Meil <doug.meil@explorysmedical.com>:
> I'm not entirely sure of the use-case, but here are some thoughts on thisÅ 
> re:  "should I take the table from the pool, and simply call the delete
> method?"
> Yep, you can construct an HTable instance within a MR job.  But use the
> delete that takes a list because the single-delete will invoke an RPC for
> each one (not great over an MR job).
> Construct the HTable instance at the Mapper level (not map-method level)
> and keep a buffer of deletes in a List.  At the end of the job, send any
> un-processed deletes in the cleanup method.
> I'm not entirely sure why you'd want to delete every row in a table (as
> opposed to processing all the rows in Table1 and generating an entirely
> new Table2).  And then drop Table1 when you're done with it.  That seems
> like it would be less hassle than deleting every row (since the table is
> empty anyway).
> On 10/12/12 1:20 PM, "Jean-Marc Spaggiari" <jean-marc@spaggiari.org> wrote:
>>I have a table which I want to parse over a MR job.
>>Today, I'm using a scan to parse all the rows. Each row is retrieve,
>>removed, and the parsed (feeding 2 other tables)
>>The goal is to parse all the content while some process might still be
>>adding some more.
>>On the map method from the MR job, can I delete the row I'm working
>>with? If so, how should I do? should I take the table from the pool,
>>and simply call the delete method? The issue is, doing a delete for
>>each line will take a while. I would prefer to batch them, but I don't
>>know when will be the last line, so it's difficult to know when to
>>send the batch.  Is there a way to say to the MR job to delete this
>>line? Also, what's the impact on the MR job if I delete the row it's
>>working one?
>>Or is the MR job not the best way to do that?

View raw message