gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Fwd: 2.x vs. 1.x speed
Date Wed, 18 Sep 2013 09:47:48 GMT
Including dev@gora.apache.org as not all of you are on the Nutch lists ;-)

Julien

---------- Forwarded message ----------
From: Julien Nioche <lists.digitalpebble@gmail.com>
Date: 16 September 2013 17:43
Subject: Re: 2.x vs. 1.x speed
To: "user@nutch.apache.org" <user@nutch.apache.org>, "dev@nutch.apache.org"
<dev@nutch.apache.org>
Cc: Otis Gospodnetic <otis_gospodnetic@yahoo.com>


Guys,

Following the discussion we had some time ago about comparing 1.x with 2.x,
we did dome tests and put the results on

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html

Feel free to comment.

Best,

Julien


On 24 August 2013 05:51, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>wrote:

> I am sure that Renato (if he is watching) can plugin maybe as well.
> We find in Gora that in every sense of the word, native Hadoop stores such
> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
> via getParitions we retrieve GoraInputSplits natively which means splits
> are obtained for MapReduce jobs... such as many of the jobs we run in Nutch
> as well. On  the other hand (currently) stores such as Cassandra and Web
> service stores such as DynamoDB do not support Hadoop out of the box (the
> former we are working on and hope to  have implemented in Gora soon)
> therefore it is not as simple to get partitions in the same way we would in
> a Hadoop native store. We therefore obtain one partition to be used as an
> InputSplit for the MR job. This is certainly an area for concern and right
> now a bottleneck for some operations. We continue to work on this.
>
>
> On Wednesday, August 7, 2013, Julien Nioche <lists.digitalpebble@gmail.com
> >
> wrote:
> > Hi Otis
> >
> > Definitely *not *the fetching speed. Actually everything but *not* the
> > fetching speed. The fetcher is pretty much the same as 1.x and anyway the
> > performance with fetching is pretty much always limited by the politeness
> > settings, not the implementation.
> >
> > Re-backend : some backend implementations are more mature than others.
> The
> > one for HBase is probably the one most widely used, the Cassandra one has
> > been greatly improved in particular performance-wise , the SQL one is
> > broken etc... we need to measure this as this is just a gut feeling at
> this
> > stage
> >
> > Now for  what is slower and why, again this has to be measured but I
> expect
> > 2.x to be slower partly because of [1], i.e. the filtering of entries is
> > not done by the backends (some might provide a way of doing it) but this
> is
> > done on the client side, when we create the input for mapred. In other
> > words we pull things from the backend just to discard it. Since 2.x does
> > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> as
> > single input) we scan the whole table even if we want to fetch or parse a
> > handful of entries.
> >
> > On the other hand, 2.x specifies what columns to retrieve for a given
> job,
> > whereas 1.x will for instance deserialize the crawldatum entirely. The
> > metadata objects are costly to read/write so 2.x might have the upper
> hand
> > from that point of view since it pulls and deserializes only what it
> needs.
> >
> > Finally the most costly steps in a large crawl in 1.x are the generation
> > and update as we have to read/write the crawldb entirely. The way the
> > updates are done in 2.x is different and should be a lot faster.
> >
> > Please could anyone correct me if I am wrong. Some of this is based on my
> > understanding of 2.x which dates back from quite a while and some of the
> > stuff might have changed in the meantime. The performance would probably
> > vary a lot based on the fine tuning of each backend implementation but
> > having some basic comparison would confirm some of the assertions above.
> >
> > Julien
> >
> >
> > [1] https://issues.apache.org/jira/browse/GORA-119
> >
> >
> > Julien, could you please elaborate a bit about your comment about speed
> >> depending on the backend used?
> >>
> >> Yes, you were the person I was referring to :)
> >>
> >> Oh, and *believe* you said it was the fetching speed that was different
> >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
> 2.x?
> >>
> >> Thanks,
> >> Otis
> >> ----
> >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> >> http://sematext.com/spm
> >>
> >>
> >>
> >>
> >> >________________________________
> >> > From: Julien Nioche <lists.digitalpebble@gmail.com>
> >> >To: "user@nutch.apache.org" <user@nutch.apache.org>
> >> >Sent: Tuesday, August 6, 2013 10:54 AM
> >> >Subject: Re: 2.x vs. 1.x speed
> >> >
> >> >
> >> >Hi Otis,
> >> >
> >> >That certainly depends on the backend used but on the whole it wouldn't
> be
> >> >surprising. Would be good to have some data to substantiate it. I am
> >> >planning to put my intern on the case and have some basic comparison as
> >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
> else
> >> >wants to do it please go ahead.
> >> >
> >> >In case I happen to be the person who told you that Otis, well at least
> I
> >> >am consistent ;-)
> >> >
> >> >Julien
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >On 6 August 2013 09:08, Otis Gospodnetic <otis.gospodnetic@gmail.com>
> >> wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> At some point earlier this year I spoke to a person who told me 2.x
> is
> >> >> (a little?) slower than 1.x.  Is that still the case?
> >> >>
> >> >> Thanks,
> >> >> Otis
> >> >> --
> >> >> Solr & ElasticSearch Support -- http://sematext.com/
> >> >> Performance Monitoring -- http://sematext.com/spm
> >> >>
> >> >
> >> >
> >> >
> >> >--
> >> >*
> >> >*Open Source Solutions for Text Engineering
> >> >
> >> >http://digitalpebble.blogspot.com/
> >> >http://www.digitalpebble.com
> >> >http://twitter.com/digitalpebble
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
> --
> *Lewis*
>



-- 
*
*
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message