nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bradford Stephens <bradfordsteph...@gmail.com>
Subject Re: The Future of Nutch, reactivated
Date Tue, 19 May 2009 23:10:35 GMT
I would like to point out that Nutch is going to be very essential to our
company's infrastructure-- we're definitely case #1. We'll probably have it
running on 100 boxes in a few weeks.

On Tue, May 19, 2009 at 2:26 PM, Mark Olson <Mark.Olson@quantum.com> wrote:

>  R
>
> ----- Original Message -----
> From: Aaron Binns <aaron@archive.org>
> To: nutch-dev@lucene.apache.org <nutch-dev@lucene.apache.org>
> Sent: Tue May 19 13:23:37 2009
> Subject: Re: The Future of Nutch, reactivated
>
>
> Andrzej Bialecki <ab@getopt.org> writes:
>
> >> One of the biggest boons of Nutch is the Hadoop infrastructure.  When
> >> indexing massive data sets, being able to fire up 60+ nodes in a
> >> Hadoop system helps tremendously.
> >
> > Are you familiar with the distributed indexing package in Hadoop
> > contrib/ ?
>
> Only superficially at most.  Last I looked at it, it seemed to be a
> "hello world" prototype.  If it's developed more, it might be worth
> another look.
>
> >> However, the one of the biggest challenges to using Nutch is the fact
> >> that the URL is used as the unique key for a document.
> >
> > Indeed, this change is something that I've been considering, too -
> > URL==page doesn't work that well in case of archives, but also when
> > your unit of information is smaller (pagelet) or larger (compound
> > docs) than a page.
> >
> > People can help with this by working on a patch that replaces this
> > silent assumption with an explicit API, i.e. splitting recordId and
> > URL into separate fields.
>
> Patches always welcomed, it is an open source package after all :) I'll
> see about creating a patch-set for the changes I've made in NutchWAX.
>
> >> As for the future of Nutch, I am concerned over what I see to be an
> >> increasing focus on crawling and fetching.  We have only lightly
> >> evaluated other Open Source search projects, such as Solr, and are not
> >> convinced any can be a drop-in replacement for Nutch.  It looks like
> >> Solr has some nice features for certain, I'm just not convinced it can
> >> scale up to the billion document level.
> >
> > What do you see as the unique strength of Nutch, then? IMHO there are
> > existing frameworks for distributed indexing (on Hadoop) and
> > distributed search (e.g. Katta). We would like to avoid the
> > duplication of effort, and to focus instead on the aspects of Nutch
> > functionality that are not available elsewhere.
>
> Right now, the unique strength of Nutch -- to my organization -- is that
> it has all the requisite pieces and comes closer to a complete solution
> than other OpenSource projects.  What features it lacks compared to
> others are less important than the ones it has that others do not.
>
> Two key features of Nutch indexing are the content parsing and the link
> extraction.  The parsing plugins seem to work well enough, although
> easier modification of content tokenizing and stop-list management would
> be nice.  For example, using a config file to tweak the tokenizing for
> say French or Spanish would be nicer than having to write a new .jj file
> and a custom build.
>
> Along the same lines, language-awareness would have to be included in
> the query processing as well.  And speaking of which, the way in which
> Nutch query processing is optimized for web search makes sense.  I've
> read that Solr can be configured to emulate the Nutch query processing.
> If so, it would eliminate a competitive advantage of Nutch.
>
> Nutch's summary/snippet generation approach works fine.  It's not clear
> to me how this is done with the other tools.
>
> On the search service side of things, Nutch is adequate, but I would
> like to investigate other distributed search systems.  My main complaint
> about Nutch's implementation is the use of the Hadoop RPC mechanism.
> It's very difficult to diagnose and debug problems.  I'd prefer if the
> master just talked to the slaves over OpenSearch or a simple HTTP/JSON
> interface.  This way, monitoring tools could easily ping the slaves and
> check for sensible results.
>
> Along the same diagnosis/debug lines, I've added more log messages to
> the start-up code of the search slave.  Without these, it's very
> difficult to diagnose some trivial mistake in the deployment of the
> index/segment shards, such as a mis-named directory or the like.
>
> Lastly, there's also the fact that Nutch is a known quantity and we've
> already put non-trivial effort into using and adapting it to our needs.
> It would be difficult to start all over again with another toolset, or
> assemblage of tools.  We also have scaling expectations based on what
> we've achieved so far with Nutch(WAX).  It would be painful to invest
> the time and effort in say Solr only to discover it can't scale to the
> same size with the same hardware.
>
>
> Right now, the most interesting other project for us to consider is
> Solr.  There seems to be more and more momentum behind it and it does
> have some neat features, such as the "did you mean?" suggestions and
> things.  However, the distributed search functionality is pretty
> rudimentary IMO and I am concerned about reports that it doesn't scale
> beyond a few million or tens of millions of documents.  Although it
> appears that some of this has to do with the modify/update capabilities,
> which are mitigated by the use of read-only IndexReaders (or something
> like that).
>
>
> Aaron
>
> --
> Aaron Binns
> Senior Software Engineer, Web Group
> Internet Archive
> aaron@archive.org
>  ------------------------------
> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information is
> not permitted unless such privilege is explicitly granted in writing by
> Quantum Corporation. Furthermore, Quantum Corporation is not responsible for
> the proper and complete transmission of the substance of this communication
> or for any delay in its receipt.
>

Mime
View raw message