hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris K Wensel <ch...@wensel.net>
Subject Re: Of Configurations and Contexts
Date Fri, 12 Feb 2010 16:43:52 GMT
Is this a side effect driven by having all the static accessors speak to specific Context types
instead of Configuration?



On Feb 11, 2010, at 6:13 PM, Aaron Kimball wrote:

> Filed https://issues.apache.org/jira/browse/MAPREDUCE-1486 with a concrete
> test case.
> On Wed, Feb 10, 2010 at 5:17 PM, Aaron Kimball <aaron@cloudera.com> wrote:
>> This patch would allow the configuration to flow through the map task in a
>> linear fashion:
>> diff --git src/java/org/apache/hadoop/mapred/MapTask.java
>> src/java/org/apache/hadoop/mapred/MapTask.
>> index f889a47..4170922 100644
>> --- src/java/org/apache/hadoop/mapred/MapTask.java
>> +++ src/java/org/apache/hadoop/mapred/MapTask.java
>> @@ -637,7 +637,9 @@ class MapTask extends Task {
>>     org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY,
>>     mapContext =
>> -      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,
>> getTaskID(),
>> +      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(
>> +          taskContext.getConfiguration(),
>> +          getTaskID(),
>>           input, output,
>>           committer,
>>           reporter, split);
>> On Wed, Feb 10, 2010 at 2:33 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>> On Wed, Feb 10, 2010 at 2:25 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>> This is, if I'm understanding Aaron correctly, the same issue that
>>>> makes the mapred.input.file configuration very hard to implement in
>>>> the new API.
>>> Sorry, I should clarify this. Obviously that particular feature can be
>>> accessed by downcasting the input split to FileSplit. But this is very
>>> hard to deal with when the input format wants to use a different
>>> implementation class - you end up coupling the mapper to the
>>> inputformat in a dirty way. Or, if you want to access the input file
>>> name from the OutputFormat, I believe you're entirely out of luck
>>> (though I haven't looked in trunk). In the prior API where
>>> Configuration got passed through the flow nicely, it was trivial to do
>>> this.
>>> -Todd
>>>> -Todd
>>>> On Wed, Feb 10, 2010 at 2:16 PM, Aaron Kimball <aaron@cloudera.com>
>>> wrote:
>>>>> Hi folks,
>>>>> I've uncovered some behavior in Hadoop that I found surprising. I think
>>> this
>>>>> represents a design flaw that I'd like to see corrected.
>>>>> As we well know, decoupled components in a MapReduce job communicate
>>>>> information forward through the use of Configuration instances. Every
>>>>> Context (JobContext, TaskAttemptContext, MapContext, etc) carries a
>>>>> Configuration object inside, accessible via getConfiguration().
>>>>> The semantics of passing data from the "configuration phase" to the
>>> "run
>>>>> phase" is easy; the user creates a Job on the client machine, populates
>>> its
>>>>> Configuration with necessary values, and all those values will be
>>> visible in
>>>>> the JobContext received in the map/reduce tasks themselves. Every task
>>>>> expects to get the same view of the user-configured values here.
>>>>> Similarly, in my Mapper, if during the setup() method I call
>>>>> context.getConfiguration().set("foo","bar"), I expect that
>>>>> context.getConfiguration.get("foo") returns "bar" during the cleanup()
>>>>> method. During a map task's execution, the configuration moves "forward
>>>>> linearly" through time.
>>>>> The confusing part is that during the initial setup steps of the map
>>> task, a
>>>>> series of different configurations are used. The noteworthy section of
>>> code
>>>>> is MapTask.java in the runNewMapper() method (lines 607--650). A
>>> JobContext
>>>>> is passed in; this is immediately used as the basis for a
>>>>> TaskAttemptContext. The TAC is then used to initialize the InputFormat
>>> and
>>>>> the RecordReader. The JobContext is then re-used to instantiate a
>>>>> MapContext. The RecordReader's "initialize" method is then called with
>>> this
>>>>> context, ostensibly to "switch the RR over" to the MapContext. The
>>> Mapper
>>>>> itself is then run with the MapContext. Each of these two new Context
>>>>> objects makes a deep copy of the Configuration present in JobContext.
>>>>> The problem here is that if the InputFormat sets any Configuration
>>> settings,
>>>>> the RecordReader will properly receive those during its construction
>>> but
>>>>> the same RecordReader may be using a *different* context and thus a
>>>>> *different* configuration during the actual running of the Mapper
>>> itself!
>>>>> LineRecordReader in particular downcasts its TaskAttemptContext to a
>>>>> MapContext at some point during its lifetime, assuming that this
>>>>> initialize() call has been made and that the new context is a
>>> MapContext.
>>>>> This is completely type-unsafe, and prevents LineRecordReader from
>>> being
>>>>> wrapped inside another RecordReader in all cases.
>>>>> Furthermore, other RecordReader initialize() methods do not do
>>> anything;
>>>>> they continue to use the Context they were created with.
>>>>> So now Configuration settings set in InputFormat.createRecordReader()
>>> may or
>>>>> may not be present in the Configuration accessible during
>>>>> RecordReader.nextKeyValue() depending on RecordReader.initialize()'s
>>>>> semantics (and that of any outer RecordReader wrapping this one!).
>>>>> This led to a pretty subtle bug in some code I was writing yesterday
>>> using
>>>>> CombineFileInputFormat, which requires that you wrap some RecordReader
>>>>> instances in others.
>>>>> So my questions are:
>>>>> * Is there a solid rationale for isolating the Configuration used in
>>> these
>>>>> various points in time?
>>>>> * If not, is there a reason to make those deep copies of the
>>> Configuration?
>>>>> or can they all just share a reference to the same Configuration
>>> instance?
>>>>> * If we really want deep copies, can the MapContext's copy be based off
>>> the
>>>>> TaskAttemptContext's copy, so that we at least have a linear flow of
>>>>> configuration settings through the execution of MapTask.runNewMapper()?
>>>>> I'm happy to write a patch to make these semantics more clear. As it
>>> is, I
>>>>> think the notion of needing to reinitialize the RecordReader with a
>>>>> completely different context is error-prone. (CombineFileRecordReader,
>>> for
>>>>> example, in its initialize() method, does not call
>>> curReader.initialize() to
>>>>> initialize its child. This is a separate bug, which I'll post a patch
>>> for,
>>>>> but the design of the context situation makes this more problematic
>>> than it
>>>>> otherwise needs to be.)
>>>>> Does anyone have any input on this situation?
>>>>> Thanks,
>>>>> - Aaron Kimball

Chris K Wensel

View raw message