mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: Map/Reduce algorithm needed- histogram
Date Tue, 08 Mar 2011 08:31:01 GMT
Really it is to support an arbitrary number of buckets. But thank you,
TeraSort does some interesting things with the M/R Hadoop framework.

One use for this is to find and remove the 5% most and least common
terms from a Lucene index, as part of vectorizing, or making spelling


On Mon, Mar 7, 2011 at 10:10 PM, Chris Schilling <> wrote:
> What about running terrasort map/reduce?  I am not sure on the run-time complexity differences,
but if you want to find the median of a large set, it gets the job done
> On Mar 7, 2011, at 9:55 PM, Lance Norskog wrote:
>> And now for some real machine learning. I would like to make a
>> histogram of values in a dataset. For simplicity let's just find the
>> median.
>> I would like to find the median value in a dataset via Map/Reduce. The
>> basic idea is distributed error estimation and reduction:
>>    start with a guess for the median.
>>    map each value to one side of the median.
>>    when the mapper finishes, also emit metadata to all reducers.
>>        this metadata is the largest or smallest value in this sample.
>>    the receiver gets the metadata from all mappers for the 'guessed' median.
>>    It also takes the boundary for the items it receives. It then
>> combines all of the boundary values into a new guess of the median.
>>    The job driver combines all of the reducer's boundaries and
>> creates a new composite guess.
>>    And repeat with some end condition. One can demand complete
>> accuracy, or a percentage of misclassified items.
>> --------------------------------------------------
>> More detail
>> This is an iterative design. Between each pass, there is a 'current
>> median' value passed as metadata. At the start, pick an arbitrary
>> value.
>> Key/value pair: ('upper'/'lower', (purpose='value'/boundary, item)
>> In the key/value pair, there is a special boolean/enum type called
>> 'purpose=value or boundary'. This is one of two values: an entry to be
>> classified, or metadata giving the boundary values that the mapper
>> found and emitted.
>>    Start of map phase:
>>        Start with 'current median'.
>>    Mapper:
>>        The key is a boolean
>>        However, 4 kinds of output.
>>        key: boolean type: < current median or >= current median
>>        value: (boolean purpose, number value)
>>        loop(number):
>>            map number to upper/lower bin for key
>>            emit (key, (purpose = data, number) )
>>            track maximum & minimum values
>>        end
>>        emit (! key, (purpose = boundary, maximum))
>>    end Mapper
>>    Reducer:
>>        Receives all (key = boolean, (purpose = boundary, number)) first.
>>        This will receive the boundary values for the other bin,
>> because the mapper wrote the boundary to !('upper' v.s. 'lower')
>>        Calculates the mean of the boundary values as the new current median.
>>        Receives and assigns all (purpose = number) tuples.
>> There is a problem with this: the Reducer has to cache all of the
>> value tuples until it gets all of the boundary tuples. Thus it is
>> memory-bound. The Mapper is not memory-bound.
>> A hack to fix this is to create a custom sorter that causes each
>> reducer to receive the boundary tuples first.
>> --
>> Lance Norskog

Lance Norskog

View raw message