gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renato MarroquĂ­n Mogrovejo <renatoj.marroq...@gmail.com>
Subject Re: Data locality
Date Thu, 03 Oct 2013 14:46:32 GMT
Hi Julien,

My comments are inline.

2013/10/3 Julien Nioche <lists.digitalpebble@gmail.com>

> Hi Renato
>
> Thanks for your comments
>
> Gora doesn't take into account data locality as this is taken care of by
> > each different data store.
>
>
> The datastores don't know anything about MapReduce nodes.


True.


> Imagine you have
> a Hadoop cluster with 3 nodes and that each node also runs a datastore like
> HBase, if Gora does not enforce data locality (as it does with HDFS
> content), then Gora would build the mapreduce inputs using data from other
> nodes meaning more network traffic. Am I missing something here?
>

Gora still doesn't know anything about data locality, it relies on how the
InputFormat is created for each data store and in that way it is the
InputFormat for each data store the one in charge of passing this
information into Gora. Gora acts as a connector to different data stores.
In the case of HBase, the input splits created do come directly from HBase
so that tells the MapReduce job how to connect (being aware of data
locality) to HBase.
Now let's say if we want to use Gora for connecting to a stand alone SQL
data base, then the creation of the number of splits should probably be
done by getting the total number of records to be queried, and then create
a number of Mappers to retrieve this total number of records in parallel.
Data locality will be pass to Gora through the InputFormat for that SQL
data base.
So, in a way Gora "is aware" through the InputFormats implemented
specifically for each data store, but Gora does not implement this feature
by itself, it relies on data stores' input formats.


> > The one thing that Gora "take" into account is
> > the number of partitions it should use, and that number of partitions are
> > used to run more/less map tasks. This partition  number hasn't been
> > implemented by all data stores properly and AFAIK most of them return a
> > single partition, which means we only use a Map task to read as much data
> > as we have to.
> >
>
> That does not match what we found in
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html as
> the data from HBase and Cassandra was represented as several partitions.
> see
>
> *The distribution of Mappers and Reducers for each task also stays constant
> over all iterations with N2C, while with N2Hbase data seems to be
> partitioned differently in each iteration and more Mappers are required  as
> the crawl goes on. This results in a longer processing time as our Hadoop
> setup allows only up to 2 mappers to be used at the same time. Curiously,
> this increase in the number of mappers was for the same number of entries
> as input. *
> *
> *
> *The number of mappers used by N2H and N2C is the main explanation for the
> differences between them. To give an example, the generation step in the
> first iteration took 11.6 minutes with N2C whereas N2H required 20 minutes.
> The latter had its input represented by 3 Mappers whereas the former
> required only 2 mappers. The mapping part would have certainly taken a lot
> less time if it had been forced into 2 mappers with a larger input (or if
> our cluster allowed more than 2 mappers / reducers) .*
>

Yeah for sure this is interesting, but I think that has to do with the
Number of Mappers you configured on your cluster (I think that by default
it comes configure to use 2 mappers).
I am saying this because if you see [1], by default we are creating a
single partition for Cassandra, so you might be even reading that data
twice whereas in HBase [2] the accurate number of partitions is created.
Hope it makes sense.


Renato M.

[1]
https://github.com/apache/gora/blob/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L291
[2]
https://github.com/apache/gora/blob/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L388

> Planning to work on this coming summer (SouthAmerican summer) ;)
> >
>
> Great!
>
> Thanks
>
> Julien
>
>
>
> >
> >
> > Renato M.
> >
> >
> > 2013/10/2 Julien Nioche <lists.digitalpebble@gmail.com>
> >
> > > Hi guys,
> > >
> > > I can't quite remember whether Gora takes data locality into account
> when
> > > generating the input for a map reduce job. Could someone explain how
> its
> > is
> > > currently handled and if things differ from one backend to the other
> then
> > > how?
> > >
> > > Thanks
> > >
> > > Julien
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message