mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <lukas.vl...@gmail.com>
Subject Re: Clustering Demo
Date Sat, 17 May 2008 18:30:28 GMT
Hi,

If you want to target also other tasks (not just document classification)
then this could be helpful:
http://archive.ics.uci.edu/ml/datasets.html?format=&task=clu&att=&area=&numAtt=&numIns=&type=&sort=taskUp&view=table

Some of those data sets are reasonably small so that they could be
integrated into Mahout unit tests by default (sounds like crazy idea?).

Lukas

On Sat, May 17, 2008 at 1:39 PM, Grant Ingersoll <gsingers@apache.org>
wrote:

>
> On May 12, 2008, at 11:24 AM, Karl Wettin wrote:
>
>  Andrzej Bialecki skrev:
>>
>>> Grant Ingersoll wrote:
>>>
>>>> Anyone have any sample code or demo of running the clustering over a
>>>> large collection of documents that they could share?  Mainly looking for
an
>>>> example of taking some corpus, converting it into the appropriate Mahout
>>>> representation and then running either the k-means or the canopy clustering
>>>> on it.
>>>>
>>> It would be way cool to do this with the industry standard 20 newsgroups
>>> corpus - there have been many experiments and evaluations of this corpus, so
>>> it's good as a baseline.
>>>
>>
>> What is the result we hope to show when clustering that data set?
>>
>
> My goal is simply to have examples that people can read and try out that
> produce reasonable results.  I don't need fancy visualizations or anything,
> just code that goes from raw text input, to nice clusters.
>
>
>>
>> Either way it feels like we would have to prepare it specifically for each
>> thing we want to demonstrate. We can't just throw the data in there and
>> expect to get something smart as a reponse without knowing what we are
>> looking for. That's when you get answers like "almost all customers either
>> buys a paper or a plastic bag".
>>
>> Do we want to see 20 clusters perfectly alligned to the news group class?
>> Isn't that more of a classification problem than a clustering problem? So
>> what do we want to demonstrate? Similar threads, messages, authors?
>>
>
> I'd say similar threads/messages, I guess.
>
>
>>
>> In what way can we prepare so it makes as much sense for as many things as
>> possible we might want to show off? What class fields can we extract from
>> the headers except for author and thread identity? How do we want to
>> tokenize the text (grams of words and charachters, stemming, stopwords,
>> etc), do we want to seperate quotation from author text so we can use
>> diffrent weights to quotation, et c?
>>
>
> Let's just start simple with words and then enhance.
>
> -Grant
>



-- 
http://blog.lukas-vlcek.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message