hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Craig Macdonald (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-4776) Reducer Channels
Date Wed, 07 Nov 2012 12:51:12 GMT
Craig Macdonald created MAPREDUCE-4776:

             Summary: Reducer Channels
                 Key: MAPREDUCE-4776
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4776
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
            Reporter: Craig Macdonald

A Google paper on LDA from 2009 -- which can be found at http://plda.googlecode.com/files/aaim.pdf
-- describes what it terms "reducer channels". This is similar to MultipleOutputs, but where
the collect() in the map task specifies a name of a set of reducers, and the key values are
forwarded to the appropriate set of reducers. This infers also separate combiners and partitioning
for each reduce channel. 

It strikes me that while the same affect may be achievable in Hadoop by using special keys,
this formulation may be more natural. It would better facilitate data operations where passes
over large data could be condensed into single maps with multiple sets of reducers, resulting
in lesser mapping jobs.

(For instance, see Figure 2 of the paper, where there are two channels: one for data, one
for the model.)

I note that from the documentation of MultipleOutputs: "When named outputs are used within
a Mapper implementation, key/values written to a name output are not part of the reduce phase,
only key/values written to the job OutputCollector are part of the reduce phase."

The proposed change would address this limitation of MultipleOutputs.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message