hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Coprocessor end point vs MapReduce?
Date Thu, 25 Oct 2012 13:01:45 GMT
Hi all,

First, sorry about my slowness to reply to this thread, but it went to
my spam folder and I lost sight of it.

I don’t have good knowledge of RDBMS, and so I don’t have good
knowledge of triggers too. That’s why I looked at the endpoints too
because they are pretty new for me.

First, I can’t really use multiple tables. I have one process writing
to this table barely real-time. Another one is deleting from this
table too. But some rows are never deleted. They are timing out, and
need to be moved by the process I’m building here.

I was not aware of the possibility to setup the priority for an MR job
(any link to show how?). That’s something I will dig into. I was a bit
scared about the network load if I’m doing deletes lines by lines and
not bulk.

What I still don’t understand is, since both CP and MR are both
running on the region side, with is the MR better than the CP? Because
the hadoop framework is taking care of it and will guarantee that it
will run on all the regions?

Also, is there some sort of “pre” and “post” methods I can override
for MR jobs to initially list of puts/deletes and submit them at the
end? Or should I do that one by one on the map method?

Thanks,

JM


2012/10/18, lohit <lohit.vijayarenu@gmail.com>:
> I might be little off here. If rows are moved to another table on weekly or
> daily basis, why not create per weekly or per day table.
> That way you need to copy and delete. Of course it will not work you are
> are selectively filtering between timestamps and clients have to have
> notion of multiple tables.
>
> 2012/10/18 Anoop Sam John <anoopsj@huawei.com>
>
>> A CP and Endpoints operates at a region level.. Any operation within one
>> region we can perform using this..  I have seen in below use case that
>> along with the delete there was a need for inserting data to some other
>> table also.. Also this was kind of a periodic action.. I really doubt how
>> the endpoints alone can be used here.. I also tend towards the MR..
>>
>>   The idea behind the bulk delete CP is simple.  We have a use case of
>> deleting a bulk of rows and this need to be online delete. I also have
>> seen
>> in the mailing list many people ask question regarding that... In all
>> people were using scans and get the rowkeys to the client side and then
>> doing the deletes..  Yes most of the time complaint was the slowness..
>> One
>> bulk delete performance improvement was done in HBASE-6284..  Still
>> thought
>> we can do all the operation (scan+delete) in server side and we can make
>> use of the endpoints here.. This will be much more faster and can be used
>> for online bulk deletes..
>>
>> -Anoop-
>>
>> ________________________________________
>> From: Michael Segel [michael_segel@hotmail.com]
>> Sent: Thursday, October 18, 2012 11:31 PM
>> To: user@hbase.apache.org
>> Subject: Re: Coprocessor end point vs MapReduce?
>>
>> Doug,
>>
>> One thing that concerns me is that a lot of folks are gravitating to
>> Coprocessors and may be using them for the wrong thing.
>> Has anyone done any sort of research as to some of the limitations and
>> negative impacts on using coprocessors?
>>
>> While I haven't really toyed with the idea of bulk deletes, periodic
>> deletes is probably not a good use of coprocessors.... however using them
>> to synchronize tables would be a valid use case.
>>
>> Thx
>>
>> -Mike
>>
>> On Oct 18, 2012, at 7:36 AM, Doug Meil <doug.meil@explorysmedical.com>
>> wrote:
>>
>> >
>> > To echo what Mike said about KISS, would you use triggers for a large
>> > time-sensitive batch job in an RDBMS?  It's possible, but probably not.
>> > Then you might want to think twice about using co-processors for such a
>> > purpose with HBase.
>> >
>> >
>> >
>> >
>> >
>> > On 10/17/12 9:50 PM, "Michael Segel" <michael_segel@hotmail.com> wrote:
>> >
>> >> Run your weekly job in a low priority fair scheduler/capacity
>> >> scheduler
>> >> queue.
>> >>
>> >> Maybe its just me, but I look at Coprocessors as a similar structure
>> >> to
>> >> RDBMS triggers and stored procedures.
>> >> You need to restrain and use them sparingly otherwise you end up
>> creating
>> >> performance issues.
>> >>
>> >> Just IMHO.
>> >>
>> >> -Mike
>> >>
>> >> On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari
>> >> <jean-marc@spaggiari.org> wrote:
>> >>
>> >>> I don't have any concern about the time it's taking. It's more about
>> >>> the load it's putting on the cluster. I have other jobs that I need
>> >>> to
>> >>> run (secondary index, data processing, etc.). So the more time this
>> >>> new job is taking, the less CPU the others will have.
>> >>>
>> >>> I tried the M/R and I really liked the way it's done. So my only
>> >>> concern will really be the performance of the delete part.
>> >>>
>> >>> That's why I'm wondering what's the best practice to move a row to
>> >>> another table.
>> >>>
>> >>> 2012/10/17, Michael Segel <michael_segel@hotmail.com>:
>> >>>> If you're going to be running this weekly, I would suggest that
you
>> >>>> stick
>> >>>> with the M/R job.
>> >>>>
>> >>>> Is there any reason why you need to be worried about the time it
>> >>>> takes
>> >>>> to do
>> >>>> the deletes?
>> >>>>
>> >>>>
>> >>>> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari
>> >>>> <jean-marc@spaggiari.org>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Mike,
>> >>>>>
>> >>>>> I'm expecting to run the job weekly. I initially thought about
>> >>>>> using
>> >>>>> end points because I found HBASE-6942 which was a good example
for
>> >>>>> my
>> >>>>> needs.
>> >>>>>
>> >>>>> I'm fine with the Put part for the Map/Reduce, but I'm not sure
>> >>>>> about
>> >>>>> the delete. That's why I look at coprocessors. Then I figure
that I
>> >>>>> also can do the Put on the coprocessor side.
>> >>>>>
>> >>>>> On a M/R, can I delete the row I'm dealing with based on some
>> criteria
>> >>>>> like timestamp? If I do that, I will not do bulk deletes, but
I
>> >>>>> will
>> >>>>> delete the rows one by one, right? Which might be very slow.
>> >>>>>
>> >>>>> If in the future I want to run the job daily, might that be
an
>> >>>>> issue?
>> >>>>>
>> >>>>> Or should I go with the initial idea of doing the Put with the
M/R
>> job
>> >>>>> and the delete with HBASE-6942?
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> JM
>> >>>>>
>> >>>>>
>> >>>>> 2012/10/17, Michael Segel <michael_segel@hotmail.com>:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I'm a firm believer in KISS (Keep It Simple, Stupid)
>> >>>>>>
>> >>>>>> The Map/Reduce (map job only) is the simplest and least
prone to
>> >>>>>> failure.
>> >>>>>>
>> >>>>>> Not sure why you would want to do this using coprocessors.
>> >>>>>>
>> >>>>>> How often are you running this job? It sounds like its going
to be
>> >>>>>> sporadic.
>> >>>>>>
>> >>>>>> -Mike
>> >>>>>>
>> >>>>>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari
>> >>>>>> <jean-marc@spaggiari.org>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> Can someone please help me to understand the pros and
cons
>> >>>>>>> between
>> >>>>>>> those 2 options for the following usecase?
>> >>>>>>>
>> >>>>>>> I need to transfer all the rows between 2 timestamps
to another
>> >>>>>>> table.
>> >>>>>>>
>> >>>>>>> My first idea was to run a MapReduce to map the rows
and store
>> >>>>>>> them
>> >>>>>>> on
>> >>>>>>> another table, and then delete them using an end point
>> >>>>>>> coprocessor.
>> >>>>>>> But the more I look into it, the more I think the MapReduce
is
>> >>>>>>> not
>> a
>> >>>>>>> good idea and I should use a coprocessor instead.
>> >>>>>>>
>> >>>>>>> BUT... The MapReduce framework guarantee me that it
will run
>> against
>> >>>>>>> all the regions. I tried to stop a regionserver while
the job was
>> >>>>>>> running. The region moved, and the MapReduce restarted
the job
>> >>>>>>> from
>> >>>>>>> the new location. Will the coprocessor do the same thing?
>> >>>>>>>
>> >>>>>>> Also, I found the webconsole for the MapReduce with
the number of
>> >>>>>>> jobs, the status, etc. Is there the same thing with
the
>> >>>>>>> coprocessors?
>> >>>>>>>
>> >>>>>>> Are all coprocessors running at the same time on all
regions,
>> >>>>>>> which
>> >>>>>>> mean we can have 100 of them running on a regionserver
at a time?
>> Or
>> >>>>>>> are they running like the MapReduce jobs based on some
configured
>> >>>>>>> values?
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>>
>> >>>>>>> JM
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>>
>
>
>
> --
> Have a Nice Day!
> Lohit
>

Mime
View raw message