mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Map/Reduce algorithm needed- histogram
Date Tue, 08 Mar 2011 07:23:03 GMT
You don't need either sort or iterations to get the median.

Check out the OnlineSummarizer for a sequential online O(n) algorithm for
median estimation that is about as accurate the sort algorithm.  You can
linearly combine multiple median estimates if you have suitably randomized
data.

On Mon, Mar 7, 2011 at 10:10 PM, Chris Schilling <chris@cellixis.com> wrote:

> What about running terrasort map/reduce?  I am not sure on the run-time
> complexity differences, but if you want to find the median of a large set,
> it gets the job done
>
>
> On Mar 7, 2011, at 9:55 PM, Lance Norskog wrote:
>
> > And now for some real machine learning. I would like to make a
> > histogram of values in a dataset. For simplicity let's just find the
> > median.
> >
> > I would like to find the median value in a dataset via Map/Reduce. The
> > basic idea is distributed error estimation and reduction:
> >    start with a guess for the median.
> >    map each value to one side of the median.
> >    when the mapper finishes, also emit metadata to all reducers.
> >        this metadata is the largest or smallest value in this sample.
> >
> >    the receiver gets the metadata from all mappers for the 'guessed'
> median.
> >    It also takes the boundary for the items it receives. It then
> > combines all of the boundary values into a new guess of the median.
> >    The job driver combines all of the reducer's boundaries and
> > creates a new composite guess.
> >    And repeat with some end condition. One can demand complete
> > accuracy, or a percentage of misclassified items.
> >
> > --------------------------------------------------
> > More detail
> >
> > This is an iterative design. Between each pass, there is a 'current
> > median' value passed as metadata. At the start, pick an arbitrary
> > value.
> > Key/value pair: ('upper'/'lower', (purpose='value'/boundary, item)
> > In the key/value pair, there is a special boolean/enum type called
> > 'purpose=value or boundary'. This is one of two values: an entry to be
> > classified, or metadata giving the boundary values that the mapper
> > found and emitted.
> >
> >    Start of map phase:
> >        Start with 'current median'.
> >    Mapper:
> >        The key is a boolean
> >        However, 4 kinds of output.
> >        key: boolean type: < current median or >= current median
> >        value: (boolean purpose, number value)
> >        loop(number):
> >            map number to upper/lower bin for key
> >            emit (key, (purpose = data, number) )
> >            track maximum & minimum values
> >        end
> >        emit (! key, (purpose = boundary, maximum))
> >    end Mapper
> >
> >    Reducer:
> >        Receives all (key = boolean, (purpose = boundary, number)) first.
> >        This will receive the boundary values for the other bin,
> > because the mapper wrote the boundary to !('upper' v.s. 'lower')
> >        Calculates the mean of the boundary values as the new current
> median.
> >        Receives and assigns all (purpose = number) tuples.
> >
> > There is a problem with this: the Reducer has to cache all of the
> > value tuples until it gets all of the boundary tuples. Thus it is
> > memory-bound. The Mapper is not memory-bound.
> >
> > A hack to fix this is to create a custom sorter that causes each
> > reducer to receive the boundary tuples first.
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message