mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Harrington <ch...@heystaks.com>
Subject Re: Does something like an "explain" feature exist in Mahout for clustering.
Date Fri, 08 Feb 2013 17:50:54 GMT
I found this on stack overflow which helped a lot. http://stackoverflow.com/questions/5805225/interpreting-output-from-mahout-clusterdumper

Since I was able to get a map file names to clusters from the link above I was able to build
something to output various interesting things. Such as a map of categories to clusters (since
the test data was labeled) and the percent of that category's docs that ended up in each cluster
(i.e. 23% of category B ended up in cluster 2). 

Then using this same info I created a directory structure of category with cluster text files
containing the content of the text files that were clustered into that cluster. 

So for each category checking where the low percent of categories went (i.e. 0.85% of category
B ended up in cluster 4) and then checking the text of those docs against the top 50 keywords
from the clusterdumper utility showed at least one top keyword was matching and causing the
strange clustering I was seeing.

Hopefully the above will be of help to someone else.



On 5 Feb 2013, at 18:43, Chris Harrington wrote:

> I'm currently using KMeans with canopy and Cosine as the measure. The data I'm using
has been somewhat curated into categories so I expected them to cluster alongside the other
documents in their respective categories. Some of them fall nicely into clusters I'd expect
but others are like the examples I gave in the first mail. i suspect some of the oddities
are due to noise in the data (of which there is a considerable amount e.g. documents with
only 2 words).
> 
> 
> On 4 Feb 2013, at 22:28, Jeff Eastman wrote:
> 
>> That's a really good question. Mahout does not have an "explain" feature; however,
you can use the ClusterDumper to print out the cluster centers and vectors clustered within
each cluster. Output is pretty verbose and, with large text vectors being truncated, might
not be that useful. You might need to write something to do this. Look at the cluster evaluator
tests for some hints.
>> 
>> Which algorithm were you using?
>> 
>> On 2/4/13 1:57 PM, Chris Harrington wrote:
>>> I was wondering if there was an explain feature in Mahout, something that gives
the reason why it did what it did, shows the values of the various features it used to evaluate
and choose the result, etc.
>>> 
>>> Because I have some wildly different text data being clustered together, for
example it clustered these 2 together and I'd like to be able to figure out why
>>> 
>>> Text 1: "Iron Butterfly Bassist Lee Dorman Dies at 70"
>>> 
>>> Text 2: "The BEST Memes Of 2012 2012 was a landmark year for memes -- and we
could say that due to the Ikea Monkey alone -- but it's not always easy…"
>>> 
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message