mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arijit <>
Subject Fw: Injecting Mahout in the nutch-solr mix
Date Sat, 27 Oct 2012 12:31:27 GMT
   As this topic concerns both nutch and mahout, I am forwarding my request for direction
which I posted on the nutch user mailing list, to this list

----- Forwarded Message -----
From: arijit <>
To: "" <> 
Sent: Saturday, October 27, 2012 5:51 PM
Subject: Injecting Mahout in the nutch-solr mix

   I have been using nutch to crawl some wiki sites and using the following in my plugin:
   o a subclass of HtmlParseFilter to do some learning of the crawled data for pattern and
   o use the learning from the earlier step in a sublclass of IndexingFilter to add additional
indexes when adding the index info into solr.

   It works. However, it means that I need to spend time doing some specific coding for
understanding these various classes of documents. I am looking at Mahout to help me with this
intermediate job - and the clustering functionality seems pretty suited to help me cluster
the crawled pages to help add the specific dimensions into solr.

   Do you think this is a good way forward? Should I try and use Mahout as a library help
me do the
 plugin stuff that I described earlier? Or is there any better way to achieve the clustering
before I add indexes into solr?

   Any help, direction on this is much appreciated.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message