Alexander,

can you give me some advises? 

I want to integrate nutch and mahout to classify crawled pages. 

1st question: Has someone tried this and are there any libraries available?

https://github.com/DigitalPebble/behemoth could be used to do Nutch -> Behemoth -> Mahout. The only problem is that there is no standard format for the Mahout classifiers so you would need to write a bit of code for it. There is also a SOLR plugin in Behemoth

Alternatively you can use out Text Classification API (https://github.com/DigitalPebble/TextClassification) within a Nutch indexing filter.
 

next: What is better/easier? Improve nutch and inject mahout classifier into the project OR improve mahout to add an ability to read and write nutch files?

Depends on what you need to do with the data after classification.  Behemoth already does the conversion from Nutch to Mahout but again the problem is the lack of standard on the Mahout side.

HTH

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble