mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Clustering techniques, tips and tricks
Date Thu, 31 Dec 2009 17:10:35 GMT
As some of you may know, I'm working on a book (it's a long time coming, but I'm getting there)
about open source techniques for working with text.  One of my chapters is on clustering and
in it, I want to talk about generic clustering approaches and then show concrete examples
of them in action.   I've got the concrete side of it down.

Based on my research, it seems people typically divide up the clustering space into two approaches:
hierarchical and flat/partitioning.  In overlaying that knowledge with what we have for techniques
in Mahout, I'm a bit stumped about where things like LDA and Dirichlet fit into those two
approaches or is there, perhaps a third that I'm missing?  They don't seem particularly hierarchical
but they don't seem flat either, if that makes any sense, given the probabilistic/mixture
nature of the algorithms.  Perhaps I should forgo the traditional division that previous authors
have taken and just talk about a suite of techniques at a little lower level?  Thoughts?

The other thing I'm interested in is people's real world feedback on using clustering to solve
their text related problems.  For instance, what type of feature reduction did you do (stopword
removal, stemming, etc.)?  What algorithms worked for you?  What didn't work?  Any and all
insight is welcome and I don't particularly care if it is Mahout specific (for instance, part
of the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)

Thanks in advance and Happy New Year,
Grant
Mime
View raw message