mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Map/Reduce algorithm needed- histogram
Date Tue, 08 Mar 2011 05:55:57 GMT
And now for some real machine learning. I would like to make a
histogram of values in a dataset. For simplicity let's just find the

I would like to find the median value in a dataset via Map/Reduce. The
basic idea is distributed error estimation and reduction:
    start with a guess for the median.
    map each value to one side of the median.
    when the mapper finishes, also emit metadata to all reducers.
        this metadata is the largest or smallest value in this sample.

    the receiver gets the metadata from all mappers for the 'guessed' median.
    It also takes the boundary for the items it receives. It then
combines all of the boundary values into a new guess of the median.
    The job driver combines all of the reducer's boundaries and
creates a new composite guess.
    And repeat with some end condition. One can demand complete
accuracy, or a percentage of misclassified items.

More detail

This is an iterative design. Between each pass, there is a 'current
median' value passed as metadata. At the start, pick an arbitrary
Key/value pair: ('upper'/'lower', (purpose='value'/boundary, item)
In the key/value pair, there is a special boolean/enum type called
'purpose=value or boundary'. This is one of two values: an entry to be
classified, or metadata giving the boundary values that the mapper
found and emitted.

    Start of map phase:
        Start with 'current median'.
        The key is a boolean
        However, 4 kinds of output.
        key: boolean type: < current median or >= current median
        value: (boolean purpose, number value)
            map number to upper/lower bin for key
            emit (key, (purpose = data, number) )
            track maximum & minimum values
        emit (! key, (purpose = boundary, maximum))
    end Mapper

        Receives all (key = boolean, (purpose = boundary, number)) first.
        This will receive the boundary values for the other bin,
because the mapper wrote the boundary to !('upper' v.s. 'lower')
        Calculates the mean of the boundary values as the new current median.
        Receives and assigns all (purpose = number) tuples.

There is a problem with this: the Reducer has to cache all of the
value tuples until it gets all of the boundary tuples. Thus it is
memory-bound. The Mapper is not memory-bound.

A hack to fix this is to create a custom sorter that causes each
reducer to receive the boundary tuples first.

Lance Norskog

View raw message