And now for some real machine learning. I would like to make a
histogram of values in a dataset. For simplicity let's just find the
median.
I would like to find the median value in a dataset via Map/Reduce. The
basic idea is distributed error estimation and reduction:
start with a guess for the median.
map each value to one side of the median.
when the mapper finishes, also emit metadata to all reducers.
this metadata is the largest or smallest value in this sample.
the receiver gets the metadata from all mappers for the 'guessed' median.
It also takes the boundary for the items it receives. It then
combines all of the boundary values into a new guess of the median.
The job driver combines all of the reducer's boundaries and
creates a new composite guess.
And repeat with some end condition. One can demand complete
accuracy, or a percentage of misclassified items.

More detail
This is an iterative design. Between each pass, there is a 'current
median' value passed as metadata. At the start, pick an arbitrary
value.
Key/value pair: ('upper'/'lower', (purpose='value'/boundary, item)
In the key/value pair, there is a special boolean/enum type called
'purpose=value or boundary'. This is one of two values: an entry to be
classified, or metadata giving the boundary values that the mapper
found and emitted.
Start of map phase:
Start with 'current median'.
Mapper:
The key is a boolean
However, 4 kinds of output.
key: boolean type: < current median or >= current median
value: (boolean purpose, number value)
loop(number):
map number to upper/lower bin for key
emit (key, (purpose = data, number) )
track maximum & minimum values
end
emit (! key, (purpose = boundary, maximum))
end Mapper
Reducer:
Receives all (key = boolean, (purpose = boundary, number)) first.
This will receive the boundary values for the other bin,
because the mapper wrote the boundary to !('upper' v.s. 'lower')
Calculates the mean of the boundary values as the new current median.
Receives and assigns all (purpose = number) tuples.
There is a problem with this: the Reducer has to cache all of the
value tuples until it gets all of the boundary tuples. Thus it is
memorybound. The Mapper is not memorybound.
A hack to fix this is to create a custom sorter that causes each
reducer to receive the boundary tuples first.

Lance Norskog
goksron@gmail.com
