mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: How to define a topic for cluster.
Date Wed, 25 Aug 2010 17:02:03 GMT
  Hi Young,

You did not mention what part(s) of Mahout you are using but I will 
assume the clustering code. LDA is designed to deduce a set of topics 
from a corpus of documents and does not require or allow the topics to 
be predefined. Some of the other clustering algorithms (e.g. k-Means, 
Fuzzy k-Means, Dirichlet) can be initialized with a set of topics 
(clusters), but after the iterations these will likely have changed 
significantly. K-Means can also be initialized by running Canopy over 
your dataset but there is no hard-coding required by any Mahout 
clustering. Once you have developed a set of topics (generally an 
offline, batch process) you can use one of the clustering 
implementations to quickly cluster new documents using those topics.

Of course, if you really want to use predefined topics then you should 
look at some of the classification algorithms which can be trained to 
sort your news articles on the fly.


On 8/25/10 9:32 AM, Young wrote:
> Hi all,
> I am using the mahout to cluster the news and I could see the top words for each cluster.
But I am very keen to know how to define a topic for each cluster? Do we have to hardcore
the topic for the cluster?
> I find an interesting site and they make excellent
topics clustering based on the page content.
> Thank you very much.
> --Young

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message