nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <>
Subject [jira] Updated: (NUTCH-237) Carrot2 clustering plugin upgrade.
Date Tue, 04 Apr 2006 10:22:44 GMT
     [ ]

Dawid Weiss updated NUTCH-237:


Hi Andrzej. The ZIP file contains a patch and svn stat with the improved code:

- The primary language for hits without explicit langid and a list of enabled languages in
the clustering component can be specified in the configuration file (readme.txt gives the

- by default all languages in Carrot2 (except for Polish) are enabled. English is the default.

- I removed the dependency on Neko in favor of the simpler routine we have in Carrot2 codebase
anyway. The change shouldn't affect the results (I checked on my local installation and it
seems to be fine).

I haven't played with the language identifier yet because I don't have a crawl with documents
containing langid codes. The code should work without problems though -- details.getValue("lang")
is converted to Carrot2's property RawDocument.PROPERTY_LANGUAGE and this is taken into account
when clustering.

I couldn't delete previously attached files. This ZIP file contains only the patch and svnstat
-- you'll have to remove a few JARs manually and replace other with their new counterparts
from the ZIP file I've attached to this issue earlier (they haven't changed). Let me know
if you need anything.

> Carrot2 clustering plugin upgrade.
> ----------------------------------
>          Key: NUTCH-237
>          URL:
>      Project: Nutch
>         Type: Improvement

>     Reporter: Dawid Weiss
>     Priority: Trivial
>  Attachments:, c2.patch,, svn-stat.txt
> This is an upgrade of the clustering plugin to the newest release (1.0.2).

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message