hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <sampanri...@gmail.com>
Subject Re: map-reduce-related school project help
Date Mon, 26 Nov 2012 02:54:15 GMT
Hi Randy,
The intermediate key-value pairs are not written to HDFS. They are written to the local file
system. Besides, if the job is "small", why do you use the MapReduce? You can just do it on
a local machine.

Jiang Shan

From: rshepherd
Date: 2012-11-26 09:38
To: mapreduce-dev
Subject: map-reduce-related school project help
Hi everybody,

I am a student at NYU and am evaluating an idea for final project for a
distributed systems class. The idea is roughly as follows; the overhead
for running map-reduce on a 'small' job is high. (A small job would be
defined as something fitting in memory on a single machine.) Can
hadoop's map-reduce be modified to be efficient for jobs such as this?

It seems that one way to do begin to achieve this goal would be to
modify the way the intermediate key-value pairs are handled, the
"handoff" from the map to the reduce. Rather than writing them to HDFS,
either pass them directly to a reducer or keep them in memory in a data
structure. Using a single, shared hashmap would alleviate the need to
sort the mapper output. Instead perhaps distribute the slots to a
reducer or reducers on multiple threads. My hope is that, as this is a
simplification of distributed  map-reduce, it will be relatively
straightforward to alter the code to in-memory approach for smaller jobs
that would perform very well for this special case.

I was hoping that someone on the list could help me with the following

1) Does this sound like a good idea that might be achievable in a few weeks?
2) Does my intuition about how to achieve the goal seem reasonable?
3) If so, any advice on now to navigate the code base? (Any pointers on
packages/classes of interest would be highly appreciated)
4) Any other feedback?

Thanks in advance to anyone willing and able to help!
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message