mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Clustering Demo
Date Mon, 12 May 2008 15:24:07 GMT
Andrzej Bialecki skrev:
> Grant Ingersoll wrote:
>> Anyone have any sample code or demo of running the clustering over a 
>> large collection of documents that they could share?  Mainly looking 
>> for an example of taking some corpus, converting it into the 
>> appropriate Mahout representation and then running either the k-means 
>> or the canopy clustering on it.
> 
> It would be way cool to do this with the industry standard 20 newsgroups 
> corpus - there have been many experiments and evaluations of this 
> corpus, so it's good as a baseline.

What is the result we hope to show when clustering that data set?

Either way it feels like we would have to prepare it specifically for 
each thing we want to demonstrate. We can't just throw the data in there 
and expect to get something smart as a reponse without knowing what we 
are looking for. That's when you get answers like "almost all customers 
either buys a paper or a plastic bag".

Do we want to see 20 clusters perfectly alligned to the news group 
class? Isn't that more of a classification problem than a clustering 
problem? So what do we want to demonstrate? Similar threads, messages, 
authors?

In what way can we prepare so it makes as much sense for as many things 
as possible we might want to show off? What class fields can we extract 
from the headers except for author and thread identity? How do we want 
to tokenize the text (grams of words and charachters, stemming, 
stopwords, etc), do we want to seperate quotation from author text so we 
can use diffrent weights to quotation, et c?



        karl

Mime
View raw message