hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Of Configurations and Contexts
Date Wed, 10 Feb 2010 22:16:17 GMT
Hi folks,

I've uncovered some behavior in Hadoop that I found surprising. I think this
represents a design flaw that I'd like to see corrected.

As we well know, decoupled components in a MapReduce job communicate
information forward through the use of Configuration instances. Every
Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
Configuration object inside, accessible via getConfiguration().

The semantics of passing data from the "configuration phase" to the "run
phase" is easy; the user creates a Job on the client machine, populates its
Configuration with necessary values, and all those values will be visible in
the JobContext received in the map/reduce tasks themselves. Every task
expects to get the same view of the user-configured values here.

Similarly, in my Mapper, if during the setup() method I call
context.getConfiguration().set("foo","bar"), I expect that
context.getConfiguration.get("foo") returns "bar" during the cleanup()
method. During a map task's execution, the configuration moves "forward
linearly" through time.

The confusing part is that during the initial setup steps of the map task, a
series of different configurations are used. The noteworthy section of code
is MapTask.java in the runNewMapper() method (lines 607--650). A JobContext
is passed in; this is immediately used as the basis for a
TaskAttemptContext. The TAC is then used to initialize the InputFormat and
the RecordReader. The JobContext is then re-used to instantiate a
MapContext. The RecordReader's "initialize" method is then called with this
context, ostensibly to "switch the RR over" to the MapContext. The Mapper
itself is then run with the MapContext. Each of these two new Context
objects makes a deep copy of the Configuration present in JobContext.

The problem here is that if the InputFormat sets any Configuration settings,
the RecordReader will properly receive those during its construction -- but
the same RecordReader may be using a *different* context and thus a
*different* configuration during the actual running of the Mapper itself!
LineRecordReader in particular downcasts its TaskAttemptContext to a
MapContext at some point during its lifetime, assuming that this
initialize() call has been made and that the new context is a MapContext.
This is completely type-unsafe, and prevents LineRecordReader from being
wrapped inside another RecordReader in all cases.

Furthermore, other RecordReader initialize() methods do not do anything;
they continue to use the Context they were created with.

So now Configuration settings set in InputFormat.createRecordReader() may or
may not be present in the Configuration accessible during
RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
semantics (and that of any outer RecordReader wrapping this one!).

This led to a pretty subtle bug in some code I was writing yesterday using
CombineFileInputFormat, which requires that you wrap some RecordReader
instances in others.

So my questions are:
* Is there a solid rationale for isolating the Configuration used in these
various points in time?
* If not, is there a reason to make those deep copies of the Configuration?
or can they all just share a reference to the same Configuration instance?
* If we really want deep copies, can the MapContext's copy be based off the
TaskAttemptContext's copy, so that we at least have a linear flow of
configuration settings through the execution of MapTask.runNewMapper()?

I'm happy to write a patch to make these semantics more clear. As it is, I
think the notion of needing to reinitialize the RecordReader with a
completely different context is error-prone. (CombineFileRecordReader, for
example, in its initialize() method, does not call curReader.initialize() to
initialize its child. This is a separate bug, which I'll post a patch for,
but the design of the context situation makes this more problematic than it
otherwise needs to be.)

Does anyone have any input on this situation?
- Aaron Kimball

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message