mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Brickley <>
Subject Re: Huge classification engine
Date Fri, 01 Apr 2011 08:24:15 GMT
On 1 April 2011 10:00, vineet yadav <> wrote:
> Hi,
> I suggest you to use Map-reduce with crawler architecture for crawling
> local file system. Since parsing HTML pages creates more overhead
> delays.

Apache Nutch being the obvious choice there -

I'd love to see some recipes documented that show Nutch and Mahout
combined. For example scenario, crawling some site(s), classifying and
having the results available in Lucene/Solr for search and other apps. looks like a good
start for the Nutch side, but I'm unsure of the hooks / workflow for
Mahout integration.

Regarding training data for categorisation that targets Wikipedia
categories, you can always pull in the textual content of *external*
links referenced from Wikipedia. For this kind of app you can probably
use the extractions from the DBpedia project, see the various download
files at (you'll want at least the
'external links' file, perhaps 'homepages' and others too). Also the
category information is extracted there, see: "article categories",
"category labels", and "categories (skos)" downloads. The latter gives
some hierarchy, which might be useful for filtering out noise like
admin categories or those that are absurdly detailed or general.

Another source of indicative text is to cross-reference these
categories to DMoz ( via common URLs. I started
an investigation of that using Pig, which I should either finish or
writeup. But Wikipedia's 'external links' plus using the category
hierarchy info should be a good place to start, I'd guess.



View raw message