mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <>
Subject Re: Clustering Demo
Date Sat, 17 May 2008 18:30:28 GMT

If you want to target also other tasks (not just document classification)
then this could be helpful:

Some of those data sets are reasonably small so that they could be
integrated into Mahout unit tests by default (sounds like crazy idea?).


On Sat, May 17, 2008 at 1:39 PM, Grant Ingersoll <>

> On May 12, 2008, at 11:24 AM, Karl Wettin wrote:
>  Andrzej Bialecki skrev:
>>> Grant Ingersoll wrote:
>>>> Anyone have any sample code or demo of running the clustering over a
>>>> large collection of documents that they could share?  Mainly looking for
>>>> example of taking some corpus, converting it into the appropriate Mahout
>>>> representation and then running either the k-means or the canopy clustering
>>>> on it.
>>> It would be way cool to do this with the industry standard 20 newsgroups
>>> corpus - there have been many experiments and evaluations of this corpus, so
>>> it's good as a baseline.
>> What is the result we hope to show when clustering that data set?
> My goal is simply to have examples that people can read and try out that
> produce reasonable results.  I don't need fancy visualizations or anything,
> just code that goes from raw text input, to nice clusters.
>> Either way it feels like we would have to prepare it specifically for each
>> thing we want to demonstrate. We can't just throw the data in there and
>> expect to get something smart as a reponse without knowing what we are
>> looking for. That's when you get answers like "almost all customers either
>> buys a paper or a plastic bag".
>> Do we want to see 20 clusters perfectly alligned to the news group class?
>> Isn't that more of a classification problem than a clustering problem? So
>> what do we want to demonstrate? Similar threads, messages, authors?
> I'd say similar threads/messages, I guess.
>> In what way can we prepare so it makes as much sense for as many things as
>> possible we might want to show off? What class fields can we extract from
>> the headers except for author and thread identity? How do we want to
>> tokenize the text (grams of words and charachters, stemming, stopwords,
>> etc), do we want to seperate quotation from author text so we can use
>> diffrent weights to quotation, et c?
> Let's just start simple with words and then enhance.
> -Grant


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message