nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: nucth and mahout integration
Date Mon, 02 Jul 2012 09:13:12 GMT
Alexander,

can you give me some advises?
>
> I want to integrate nutch and mahout to classify crawled pages.
>
> 1st question: Has someone tried this and are there any libraries available?
>

https://github.com/DigitalPebble/behemoth could be used to do Nutch ->
Behemoth -> Mahout. The only problem is that there is no standard format
for the Mahout classifiers so you would need to write a bit of code for it.
There is also a SOLR plugin in Behemoth

Alternatively you can use out Text Classification API (
https://github.com/DigitalPebble/TextClassification) within a Nutch
indexing filter.


>
> next: What is better/easier? Improve nutch and inject mahout classifier
> into the project OR improve mahout to add an ability to read and write
> nutch files?
>

Depends on what you need to do with the data after classification.
Behemoth already does the conversion from Nutch to Mahout but again the
problem is the lack of standard on the Mahout side.

HTH

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Mime
View raw message