gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renato Marroquín Mogrovejo <renatoj.marroq...@gmail.com>
Subject Re: Data locality
Date Tue, 08 Oct 2013 02:19:27 GMT
Hi Julien,

My comments are inline as well.

2013/10/3 Julien Nioche <lists.digitalpebble@gmail.com>

> Hi Renato,
>
> Comments below (in orange)
>
> On 3 October 2013 15:46, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
> > Hi Julien,
> >
> > My comments are inline.
> >
> > 2013/10/3 Julien Nioche <lists.digitalpebble@gmail.com>
> >
> > > Hi Renato
> > >
> > > Thanks for your comments
> > >
> > > Gora doesn't take into account data locality as this is taken care of
> by
> > > > each different data store.
> > >
> > >
> > > The datastores don't know anything about MapReduce nodes.
> >
> >
> > True.
> >
> >
> > > Imagine you have
> > > a Hadoop cluster with 3 nodes and that each node also runs a datastore
> > like
> > > HBase, if Gora does not enforce data locality (as it does with HDFS
> > > content), then Gora would build the mapreduce inputs using data from
> > other
> > > nodes meaning more network traffic. Am I missing something here?
> > >
> >
> > Gora still doesn't know anything about data locality, it relies on how
> the
> > InputFormat is created for each data store and in that way it is the
> > InputFormat for each data store the one in charge of passing this
> > information into Gora. Gora acts as a connector to different data stores.
>
> In the case of HBase, the input splits created do come directly from HBase
> > so that tells the MapReduce job how to connect (being aware of data
> > locality) to HBase.
>
>
> To reformulate :  the GoraInputFormat builds the splits (which hold the
> information about where the data is located) based on what the method
> getPartitions(Query q) returns for each datastore. A datastore may or may
> not provide information about where the data is located. HBase does,
> Cassandra doesn't (see below). Correct?
>

Correct Julien. Each data store implementation hast its own advantages and
disadvantages, as there are more people involved with some data store or
another. So there might be more development in a specific data store.


> > Now let's say if we want to use Gora for connecting to a stand alone SQL
> > data base, then the creation of the number of splits should probably be
> > done by getting the total number of records to be queried, and then
> create
> > a number of Mappers to retrieve this total number of records in parallel.
> > Data locality will be pass to Gora through the InputFormat for that SQL
> > data base.
> >
>
> Data locality != data location. The gora-SQL module will indeed pass on the
> information about where the data are located (data location). However
>  whether there will be data locality (i.e code running where data is
> stored) depends on the getPartitions method. In the case of a stand alone
> SQL DB by definition there won't be data locality.
>

Well if your SQL DB is running in the same node as some of your MapReduce
tasks, then there would be "data locality". But of course, as you said,
this might not be the case most of the times.


> So, in a way Gora "is aware" through the InputFormats implemented
> > specifically for each data store, but Gora does not implement this
> feature
> > by itself, it relies on data stores' input formats.
> >
>
> or rather how getPartitions is implemented?
>

The getPartitions method uses each data stores' InputFormats ;)


> > > > The one thing that Gora "take" into account is
> > > > the number of partitions it should use, and that number of partitions
> > are
> > > > used to run more/less map tasks. This partition  number hasn't been
> > > > implemented by all data stores properly and AFAIK most of them
> return a
> > > > single partition, which means we only use a Map task to read as much
> > data
> > > > as we have to.
> > > >
> > >
> > > That does not match what we found in
> > >
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.htmlas
> > > the data from HBase and Cassandra was represented as several
> partitions.
> > > see
> > >
> > > *The distribution of Mappers and Reducers for each task also stays
> > constant
> > > over all iterations with N2C, while with N2Hbase data seems to be
> > > partitioned differently in each iteration and more Mappers are required
> >  as
> > > the crawl goes on. This results in a longer processing time as our
> Hadoop
> > > setup allows only up to 2 mappers to be used at the same time.
> Curiously,
> > > this increase in the number of mappers was for the same number of
> entries
> > > as input. *
> > > *
> > > *
> > > *The number of mappers used by N2H and N2C is the main explanation for
> > the
> > > differences between them. To give an example, the generation step in
> the
> > > first iteration took 11.6 minutes with N2C whereas N2H required 20
> > minutes.
> > > The latter had its input represented by 3 Mappers whereas the former
> > > required only 2 mappers. The mapping part would have certainly taken a
> > lot
> > > less time if it had been forced into 2 mappers with a larger input (or
> if
> > > our cluster allowed more than 2 mappers / reducers) .*
> > >
> >
> > Yeah for sure this is interesting, but I think that has to do with the
> > Number of Mappers you configured on your cluster (I think that by default
> > it comes configure to use 2 mappers).
>
> > I am saying this because if you see [1], by default we are creating a
> > single partition for Cassandra, so you might be even reading that data
> > twice whereas in HBase [2] the accurate number of partitions is created.
> >
>
> So if I understand correctly :
> (a) there is no gain in using Cassandra in distributed mode as it will
> represent everything as a single partition
>

Sadly but true.


> (b) in that case why are we getting more than one mapper working? Shouldn't
> there be just a single one?
>

The number of Mappers you get is by the number of Mappers your cluster can
run. I was running simple MapReduce jobs on my local computer with default
settings and I always got two mappers, then I changed the mapred.max.split.size
to bring smaller chunks of data and got a single one. We should find a way
on how to set this parameter automatically for improving Gora's data access.


> (c) if there is a single partition then we can't enforce data locality
>

True for data stores which are not using their own input formats inside the
getPartition method.


> (d) shouldn't the HBase implementation return a constant number of
> partitions given an input of the same size?


Yes, it should. Isn't this the case?


> > Hope it makes sense.
> >
>
> Well, we are definitely making progress on the terms used at least.
>
> Thanks for having taken the time to explain, am sure this thread will be
> useful to others
>

Of course it will mate.


Renato M.


> Julien
>
>
> >
> >
> > Renato M.
> >
> > [1]
> >
> >
> https://github.com/apache/gora/blob/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L291
> > [2]
> >
> >
> https://github.com/apache/gora/blob/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L388
> >
> > > Planning to work on this coming summer (SouthAmerican summer) ;)
> > > >
> > >
> > > Great!
> > >
> > > Thanks
> > >
> > > Julien
> > >
> > >
> > >
> > > >
> > > >
> > > > Renato M.
> > > >
> > > >
> > > > 2013/10/2 Julien Nioche <lists.digitalpebble@gmail.com>
> > > >
> > > > > Hi guys,
> > > > >
> > > > > I can't quite remember whether Gora takes data locality into
> account
> > > when
> > > > > generating the input for a map reduce job. Could someone explain
> how
> > > its
> > > > is
> > > > > currently handled and if things differ from one backend to the
> other
> > > then
> > > > > how?
> > > > >
> > > > > Thanks
> > > > >
> > > > > Julien
> > > > >
> > > > > --
> > > > > *
> > > > > *Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > > http://twitter.com/digitalpebble
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message