hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul Muralidharan <mrid...@yahoo-inc.com>
Subject Re: setNumReduceTasks(1)
Date Sat, 30 Jan 2010 09:35:35 GMT

Top K is slightly more complicated (in comparison) to implement 
efficiently : you might want to look at other projects like pig to see 
how they do it (to compare and look at ideas).

Just to get an understanding - your mappers generate <key, value>, and 
you want to pick top K based on value in reducer side ?
Or can you have multiple key's coming in from various mappers and you 
need to aggregate it at reducer ?


If former (that is key is unique), then a combiner to emit's top K per 
mapper, and then a single reducer which sorts and picks from the M * C * 
K tuples should do the trick (M == number of mappers, C == avg number of 
combiner invocations per mapper, K == number of output tuples required).

If latter, you can try to do heuristics to approximate the value, but it 
always has a error margin (to efficiently do it : this is something I 
ask in interviews :) ) which you will need to take into account - or you 
can just split it into two jobs : aggregate in job 1, top K in job 2.

Regards,
Mridul


Something Something wrote:
> N could be up to 1000, and output from Map job could be about 5 Million.  We
> only want the top 1000 because rest of it could be just noise.  Thanks for
> your help.
> 
> On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <alex.baranov.v@gmail.com>wrote:
> 
>> How big is N?  How big is outcome of Map job?
>>
>> Alex.
>>
>> On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>>> I am sorry, but I forgot to add one important piece of information.
>>>
>>> I don't want to write any random N rows to the table.  I want to write
>> the
>>> *top* N rows - meaning - I want to write the "key" values of the Reducer
>> in
>>> descending order.  Does this make sense?  Sorry for the confusion.
>>>
>>> On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
>>> mridulm@yahoo-inc.com
>>>> wrote:
>>>> A possible solution is to emit only N rows from each mapper and then
>> use
>>> 1
>>>> reduce task [*] - if value of N is not very high.
>>>> So you end up with utmost m * N rows on reducer instead of full
>> inputset
>>> -
>>>> and so the limit can be done easier.
>>>>
>>>>
>>>> If you ok with some sort of variance in the number of rows inserted
>> (and
>>> if
>>>> value of N is very high), you can do more interesting things like N/m'
>>> rows
>>>> per mapper - and multiple reducers (r) : with assumtion that each
>> reducer
>>>> will see atleast N/r rows - and so you can limit to N/r per reducer :
>>>> ofcourse, there is a possible error that gets introduced here ...
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>> [*] Assuming you just want simple limit - nothing else.
>>>> Also note, each mapper might want to emit N rows instead of 'tweaks'
>> like
>>>> N/m rows, since it is possible that multiple mappers might have less
>> than
>>>> N/m rows to emit to begin with !
>>>>
>>>>
>>>>
>>>> Something Something wrote:
>>>>
>>>>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>>>>> class
>>>>> be instantiated only on one machine.. always?  I mean if I have a
>>> cluster
>>>>> of
>>>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class
>> guaranteed
>>>>> to
>>>>> be instantiated only on 1 machine?
>>>>>
>>>>> If answer is yes, then I will use static variable as a counter to see
>>> how
>>>>> may rows have been added to my HBase table so far.  In my use case, I
>>> want
>>>>> to write only N number of rows to a table.  Is there a better way to
>> do
>>>>> this?  Please let me know.  Thanks.
>>>>>
>>>>


Mime
View raw message