nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Maki" <>
Subject Labeling URLs a-la Google
Date Thu, 06 Sep 2007 20:04:18 GMT
Hello everybody,

I'm working on a project that is essentially a searchable database for
academic citations at the University of Pittsburgh. One of our
searching requirements was to be able to break the search results into
sections--in order to do this, I implemented something similar to
Google's "labels".

It's based heavily on the example plugin, and maybe not so pretty
code-wise, but it's a start.

Downloadable here:

You configure it by adding something like the below to your nutch-site.xml file:

    http://dev3\.informalscience\.org/research.*\.php.* = firsttag
secondtag thirdtag,
    http://dev3\.informalscience\.org/project.*\.php.* = project,
    http://www.?\.informalscience\.org.* = oldsite,
    http://dev3\.informalscience\.org.* = devsite

* Format of each line is <regular expression>=<labels, space delimited>
* URLS must be unique.
* Multiple tags for the same pattern are delimited by a space.

Hope this saves somebody some time,


(BTW, Nutch as worked very well for us--excellent project!)

View raw message