nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Maki" <crimesagainstlo...@gmail.com>
Subject Labeling URLs a-la Google
Date Thu, 06 Sep 2007 20:04:18 GMT
Hello everybody,

I'm working on a project that is essentially a searchable database for
academic citations at the University of Pittsburgh. One of our
searching requirements was to be able to break the search results into
sections--in order to do this, I implemented something similar to
Google's "labels".

It's based heavily on the example plugin, and maybe not so pretty
code-wise, but it's a start.

Downloadable here:
http://upclose.lrdc.pitt.edu/people/maki_assets/nutch-regex-label.tar.gz

You configure it by adding something like the below to your nutch-site.xml file:

<property>
  <name>extension.regexlabeler.labels</name>
  <value>
    http://dev3\.informalscience\.org/research.*\.php.* = firsttag
secondtag thirdtag,
    http://dev3\.informalscience\.org/project.*\.php.* = project,
    http://www.?\.informalscience\.org.* = oldsite,
    http://dev3\.informalscience\.org.* = devsite
  </value>
</property>

Notes:
* Format of each line is <regular expression>=<labels, space delimited>
* URLS must be unique.
* Multiple tags for the same pattern are delimited by a space.

Hope this saves somebody some time,

-Jeff

(BTW, Nutch as worked very well for us--excellent project!)

Mime
View raw message