hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahesh Balija <balijamahesh....@gmail.com>
Subject Re: map-reduce-related school project help
Date Mon, 26 Nov 2012 05:46:18 GMT
Hi Randy/Alex,

                Your problem seems to be interesting and it is understood
that you want to provide a way in Hadoop to handle small jobs as well.

                Please see my inline answers,

On Mon, Nov 26, 2012 at 7:08 AM, rshepherd <rjs471@nyu.edu> wrote:

> Hi everybody,
> I am a student at NYU and am evaluating an idea for final project for a
> distributed systems class. The idea is roughly as follows; the overhead
> for running map-reduce on a 'small' job is high. (A small job would be
> defined as something fitting in memory on a single machine.) Can
> hadoop's map-reduce be modified to be efficient for jobs such as this?
> It seems that one way to do begin to achieve this goal would be to
> modify the way the intermediate key-value pairs are handled, the
> "handoff" from the map to the reduce. Rather than writing them to HDFS,
> either pass them directly to a reducer or keep them in memory in a data
> structure. Using a single, shared hashmap would alleviate the need to
> sort the mapper output. Instead perhaps distribute the slots to a
> reducer or reducers on multiple threads. My hope is that, as this is a
> simplification of distributed  map-reduce, it will be relatively
> straightforward to alter the code to in-memory approach for smaller jobs
> that would perform very well for this special case.
Actually framework is responsible for invoking the mapper and reducer
And maintaining the intermediate records in a local file system.
NOT sure how much code you need to re-write to handle this case. (May be
Context which writes the data and partitioning, invoking reducer function
for your Hashmap entries etc ) .
NOTE:- As your hasmap is as small as it can fit into memory serializing
your hashmap to the corresponding reducer will be a overhead if the reducer
is not in the same node. (its better to avoid serializing to a different

> I was hoping that someone on the list could help me with the following
> questions:
> 1) Does this sound like a good idea that might be achievable in a few
> weeks?
Though this idea is interesting it might need lot of effort as you have to
understand the framework thoroughly. Also may need lot of code changes.
Along with that it should be configurable or should be a property set on
the Job instance.

> 2) Does my intuition about how to achieve the goal seem reasonable?
NOT really sure as you need to dig down various components.

> 3) If so, any advice on now to navigate the code base? (Any pointers on
> packages/classes of interest would be highly appreciated)
Context, partitioner, Mapper, Reducer, Job/JobConf, Backend framework
classes which invoke them and may be more which I couldn't imagine now.

> 4) Any other feedback?

Your idea seem to be exactly other-way how hadoop operates.
Evaluate some options like running a job in Local runner mode etc and how
is that different from your idea/approach.
Also making this more efficient by handling different cases will be a
biggest concern (like serializing the map though its not needed).

> Thanks in advance to anyone willing and able to help!
> Randy

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message