mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Schilling <>
Subject Re: Map/Reduce algorithm needed- histogram
Date Tue, 08 Mar 2011 06:10:53 GMT
What about running terrasort map/reduce?  I am not sure on the run-time complexity differences,
but if you want to find the median of a large set, it gets the job done

On Mar 7, 2011, at 9:55 PM, Lance Norskog wrote:

> And now for some real machine learning. I would like to make a
> histogram of values in a dataset. For simplicity let's just find the
> median.
> I would like to find the median value in a dataset via Map/Reduce. The
> basic idea is distributed error estimation and reduction:
>    start with a guess for the median.
>    map each value to one side of the median.
>    when the mapper finishes, also emit metadata to all reducers.
>        this metadata is the largest or smallest value in this sample.
>    the receiver gets the metadata from all mappers for the 'guessed' median.
>    It also takes the boundary for the items it receives. It then
> combines all of the boundary values into a new guess of the median.
>    The job driver combines all of the reducer's boundaries and
> creates a new composite guess.
>    And repeat with some end condition. One can demand complete
> accuracy, or a percentage of misclassified items.
> --------------------------------------------------
> More detail
> This is an iterative design. Between each pass, there is a 'current
> median' value passed as metadata. At the start, pick an arbitrary
> value.
> Key/value pair: ('upper'/'lower', (purpose='value'/boundary, item)
> In the key/value pair, there is a special boolean/enum type called
> 'purpose=value or boundary'. This is one of two values: an entry to be
> classified, or metadata giving the boundary values that the mapper
> found and emitted.
>    Start of map phase:
>        Start with 'current median'.
>    Mapper:
>        The key is a boolean
>        However, 4 kinds of output.
>        key: boolean type: < current median or >= current median
>        value: (boolean purpose, number value)
>        loop(number):
>            map number to upper/lower bin for key
>            emit (key, (purpose = data, number) )
>            track maximum & minimum values
>        end
>        emit (! key, (purpose = boundary, maximum))
>    end Mapper
>    Reducer:
>        Receives all (key = boolean, (purpose = boundary, number)) first.
>        This will receive the boundary values for the other bin,
> because the mapper wrote the boundary to !('upper' v.s. 'lower')
>        Calculates the mean of the boundary values as the new current median.
>        Receives and assigns all (purpose = number) tuples.
> There is a problem with this: the Reducer has to cache all of the
> value tuples until it gets all of the boundary tuples. Thus it is
> memory-bound. The Mapper is not memory-bound.
> A hack to fix this is to create a custom sorter that causes each
> reducer to receive the boundary tuples first.
> -- 
> Lance Norskog

View raw message