nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Olson" <Mark.Ol...@quantum.com>
Subject Re: The Future of Nutch, reactivated
Date Tue, 19 May 2009 21:24:43 GMT
AA{hb

----- Original Message -----
From: Aaron Binns <aaron@archive.org>
To: nutch-dev@lucene.apache.org <nutch-dev@lucene.apache.org>
Sent: Tue May 19 13:23:37 2009
Subject: Re: The Future of Nutch, reactivated


Andrzej Bialecki <ab@getopt.org> writes:

>> One of the biggest boons of Nutch is the Hadoop infrastructure.  When
>> indexing massive data sets, being able to fire up 60+ nodes in a
>> Hadoop system helps tremendously.
>
> Are you familiar with the distributed indexing package in Hadoop
> contrib/ ?

Only superficially at most.  Last I looked at it, it seemed to be a
"hello world" prototype.  If it's developed more, it might be worth
another look.

>> However, the one of the biggest challenges to using Nutch is the fact
>> that the URL is used as the unique key for a document.
>
> Indeed, this change is something that I've been considering, too - 
> URL==page doesn't work that well in case of archives, but also when
> your unit of information is smaller (pagelet) or larger (compound
> docs) than a page.
>
> People can help with this by working on a patch that replaces this
> silent assumption with an explicit API, i.e. splitting recordId and
> URL into separate fields.

Patches always welcomed, it is an open source package after all :) I'll
see about creating a patch-set for the changes I've made in NutchWAX.

>> As for the future of Nutch, I am concerned over what I see to be an
>> increasing focus on crawling and fetching.  We have only lightly
>> evaluated other Open Source search projects, such as Solr, and are not
>> convinced any can be a drop-in replacement for Nutch.  It looks like
>> Solr has some nice features for certain, I'm just not convinced it can
>> scale up to the billion document level.
>
> What do you see as the unique strength of Nutch, then? IMHO there are
> existing frameworks for distributed indexing (on Hadoop) and
> distributed search (e.g. Katta). We would like to avoid the
> duplication of effort, and to focus instead on the aspects of Nutch
> functionality that are not available elsewhere.

Right now, the unique strength of Nutch -- to my organization -- is that
it has all the requisite pieces and comes closer to a complete solution
than other OpenSource projects.  What features it lacks compared to
others are less important than the ones it has that others do not.

Two key features of Nutch indexing are the content parsing and the link
extraction.  The parsing plugins seem to work well enough, although
easier modification of content tokenizing and stop-list management would
be nice.  For example, using a config file to tweak the tokenizing for
say French or Spanish would be nicer than having to write a new .jj file
and a custom build.

Along the same lines, language-awareness would have to be included in
the query processing as well.  And speaking of which, the way in which
Nutch query processing is optimized for web search makes sense.  I've
read that Solr can be configured to emulate the Nutch query processing.
If so, it would eliminate a competitive advantage of Nutch.

Nutch's summary/snippet generation approach works fine.  It's not clear
to me how this is done with the other tools.

On the search service side of things, Nutch is adequate, but I would
like to investigate other distributed search systems.  My main complaint
about Nutch's implementation is the use of the Hadoop RPC mechanism.
It's very difficult to diagnose and debug problems.  I'd prefer if the
master just talked to the slaves over OpenSearch or a simple HTTP/JSON
interface.  This way, monitoring tools could easily ping the slaves and
check for sensible results.

Along the same diagnosis/debug lines, I've added more log messages to
the start-up code of the search slave.  Without these, it's very
difficult to diagnose some trivial mistake in the deployment of the
index/segment shards, such as a mis-named directory or the like.

Lastly, there's also the fact that Nutch is a known quantity and we've
already put non-trivial effort into using and adapting it to our needs.
It would be difficult to start all over again with another toolset, or
assemblage of tools.  We also have scaling expectations based on what
we've achieved so far with Nutch(WAX).  It would be painful to invest
the time and effort in say Solr only to discover it can't scale to the
same size with the same hardware.


Right now, the most interesting other project for us to consider is
Solr.  There seems to be more and more momentum behind it and it does
have some neat features, such as the "did you mean?" suggestions and
things.  However, the distributed search functionality is pretty
rudimentary IMO and I am concerned about reports that it doesn't scale
beyond a few million or tens of millions of documents.  Although it
appears that some of this has to do with the modify/update capabilities,
which are mitigated by the use of read-only IndexReaders (or something
like that).


Aaron

-- 
Aaron Binns
Senior Software Engineer, Web Group
Internet Archive
aaron@archive.org

----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying,
or further distribution of confidential information is not permitted unless such privilege
is explicitly granted in writing by Quantum Corporation. Furthermore, Quantum Corporation
is not responsible for the proper and complete transmission of the substance of this communication
or for any delay in its receipt.
Mime
View raw message