nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: The Future of Nutch, reactivated
Date Thu, 14 May 2009 20:43:10 GMT
Hi Andrzej,

Great summary. My general feeling on this is similar to my prior comments on
similar threads from Otis and from Dennis. My personal pet projects for
Nutch2:

* refactored Nutch core data structures, modeled as POJOs
* refactored Nutch architecture where crawling/indexing/parsing/scoring/etc.
are insulated from the underlying messaging substrate (e.g., crawl over JMS,
EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other
framework, etc.)
* simpler Nutch deployment mechanisms (separate Nutch deployment package
from source code package), think about using Maven2

+1 to all of those and other ideas for how to improve the project's focus.

Cheers,
Chris


On 5/14/09 6:45 AM, "Andrzej Bialecki" <ab@getopt.org> wrote:

> Hi all,
> 
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point
> of view.
> 
> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
> 
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.
> 
> 2. Medium-scale vertical search: I suspect that many Nutch users fall
> into this category. Modularity, flexibility in implementing custom
> processing, ability to modify workflows and to use only some Nutch
> components seem to be chief concerns here. Scalability too, but only up
> to a volume of ~100-200 mln documents.
> 
> 3. Small- to medium-scale enterprise search: there's a sizeable number
> of Nutch users that fall into this category, for historical reasons.
> Link-based ranking and resource discovery are not that important here,
> but integration with Windows networking, Microsoft formats and databases
> , as well as realtime indexing and easy index maintenance are crucial.
> This class of users often has to heavily customize Nutch to get any
> sensible result. Also, this is where Solr really shines, so there is
> little benefit in using Nutch here. I predict that Nutch will have fewer
> and fewer users of this type.
> 
> 4. Single desktop to small intranet search: as above, but the accent is
> on the ease of use out of the box, and an often requested feature is a
> GUI frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
> 
> What is the target audience that we as a community want to support? By
> this I mean not only the moral support, but also active participation in
> the development process. From the place where we are at the moment we
> could go in any of the above directions.
> 
> Core competence
> ===============
> This is a simple but important point. Currently we maintain several
> major subsystems in Nutch that are implemented by other projects, and
> often in a better way. Plugin framework (and dependency injection) and
> content parsing are two areas that we have to delegate to third-party
> libraries, such as Tika and OSGI or some other simple IOC container -
> probably there are other components that we don't have to do ourselves.
> Another thing that I'd love to delegate is the distributed search and
> index maintenance - either through Solr or Katta or something else.
> 
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
> 
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
> 
> * web graph analysis - this includes link-based ranking, mirror
> detection (and URL "aliasing") but also link spam detection and a more
> complex control over the crawling frontier.
> 
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
> 
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot
> remains to be done here, and we are still spending a lot of resources on
> dealing with issues outside this core competence.
> 
> -------
> 
> So, what do we need to do next?
> 
> * we need to decide where we should commit our resources, as a community
> of users, contributors and committers, so that the project is most
> useful to our target audience. At this point there are few active
> committers, so I don't think we can cover more than 1 direction at a
> time ... ;)
> 
> * we need to re-architect Nutch to focus on our core competence, and
> delegate what we can to other projects.
> 
> Feel free to comment on the above, make suggestions or corrections. I'd
> like to wrap it up in a concise mission statement that would help us set
> the goals for the next couple months.
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




Mime
View raw message