gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: Data locality
Date Thu, 03 Oct 2013 08:32:17 GMT
Hi Renato

Thanks for your comments

Gora doesn't take into account data locality as this is taken care of by
> each different data store.

The datastores don't know anything about MapReduce nodes. Imagine you have
a Hadoop cluster with 3 nodes and that each node also runs a datastore like
HBase, if Gora does not enforce data locality (as it does with HDFS
content), then Gora would build the mapreduce inputs using data from other
nodes meaning more network traffic. Am I missing something here?

> The one thing that Gora "take" into account is
> the number of partitions it should use, and that number of partitions are
> used to run more/less map tasks. This partition  number hasn't been
> implemented by all data stores properly and AFAIK most of them return a
> single partition, which means we only use a Map task to read as much data
> as we have to.

That does not match what we found in
http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html as
the data from HBase and Cassandra was represented as several partitions. see

*The distribution of Mappers and Reducers for each task also stays constant
over all iterations with N2C, while with N2Hbase data seems to be
partitioned differently in each iteration and more Mappers are required  as
the crawl goes on. This results in a longer processing time as our Hadoop
setup allows only up to 2 mappers to be used at the same time. Curiously,
this increase in the number of mappers was for the same number of entries
as input. *
*The number of mappers used by N2H and N2C is the main explanation for the
differences between them. To give an example, the generation step in the
first iteration took 11.6 minutes with N2C whereas N2H required 20 minutes.
The latter had its input represented by 3 Mappers whereas the former
required only 2 mappers. The mapping part would have certainly taken a lot
less time if it had been forced into 2 mappers with a larger input (or if
our cluster allowed more than 2 mappers / reducers) .*

> Planning to work on this coming summer (SouthAmerican summer) ;)




> Renato M.
> 2013/10/2 Julien Nioche <lists.digitalpebble@gmail.com>
> > Hi guys,
> >
> > I can't quite remember whether Gora takes data locality into account when
> > generating the input for a map reduce job. Could someone explain how its
> is
> > currently handled and if things differ from one backend to the other then
> > how?
> >
> > Thanks
> >
> > Julien
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >

*Open Source Solutions for Text Engineering


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message