hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: Research projects for hadoop
Date Fri, 09 Sep 2011 10:43:28 GMT

 As Robert pointed out, performance is a primary criterion - maybe you can come back with
benchmarks? Try sorts with >100G data.

 Also, MRv2 makes it easy to play with these, you might want to try that.


On Sep 9, 2011, at 10:34 AM, Saikat Kanjilal wrote:

> How about using virtual box and centos 64 bit to serve as a linux container for isolating
map/reduce processes?  I have setup this up in the past, its really easy.
>> From: evans@yahoo-inc.com
>> To: mapreduce-dev@hadoop.apache.org
>> Date: Fri, 9 Sep 2011 10:30:37 -0700
>> Subject: Re: Research projects for hadoop
>> The biggest issue with Xen and other virtualization technologies is that often there
is an IO penalty involved with using them.  For many jobs this is not an acceptable trade
off.  I do know, however, that there has been some discussion about using Linux Containers
for isolation of Map/Reduce processes.  I don't know if any JIRA has been filed for it or
not, but they are much lighter weight then Xen and other virtualization tech, because all
it really is concerned with is resource isolation, and not virtualizing an entire operating
>> --Bobby Evans
>> On 9/9/11 10:58 AM, "Saikat Kanjilal" <sxk1969@hotmail.com> wrote:
>> Hi  Folks,I was looking through the following wiki page:  http://wiki.apache.org/hadoop/HadoopResearchProjects
and was wondering if there's been any work done (or any interest to do work) for the following
>> Integration of Virtualization (such as Xen) with Hadoop toolsHow does one integrate
sandboxing of arbitrary user code in C++ and other languages in a VM such as Xen with the
Hadoop framework? How does this interact with SGE, Torque, Condor?As each individual machine
has more and more cores/cpus, it makes sense to partition each machine into multiple virtual
machines. That gives us a number of benefits:By assigning a virtual machine to a datanode,
we effectively isolate the datanode from the load on the machine caused by other processes,
making the datanode more responsive/reliable.With multiple virtual machines on each machine,
we can lower the granularity of hod scheduling units, making it possible to schedule multiple
tasktrackers on the same machine, improving the overall utilization of the whole clusters.With
virtualization, we can easily snapshot a virtual cluster before releasing it, making it possible
to re-activate the same cluster in the future and start to work from the snapshot.Provisioning
of long running Services via HODWork on a computation model for services on the grid. The
model would include:Various tools for defining clients and servers of the service, and at
the least a C++ and Java instantiation of the abstractionsLogical definitions of how to partition
work onto a set of servers, i.e. a generalized shard implementationA few useful abstractions
like locks (exclusive and RW, fairness), leader election, transactions,Various communication
models for groups of servers belonging to a service, such as broadcast, unicast, etc.Tools
for assuring QoS, reliability, managing pools of servers for a service with spares, etc.Integration
with HDFS for persistence, as well as access to local filesystemsIntegration with ZooKeeper
so that applications can use the namespace
>> I would like to either help out with a design for the above or prototyping code,
please let me know if and what the process may be to move forward with this.
>> Regards

View raw message