mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: clustering your data with dirichlet issue
Date Tue, 06 Apr 2010 17:27:03 GMT
Hey Jeff,

  Excuse my ignorance of the Dirchlet clustering process, but in reading
your email
explaining this, I'm struck with the question: what is a user supposed to do
at all
currently with this output?  If the ClusterDumper can't spit it out until
is in, and it's in a format which is Dirichlet-specific... what do we expect
to do with it once they've run this?

  Without this final step, this seems very much like an unfinished feature,
to the
point of being unusable.


On Tue, Apr 6, 2010 at 10:14 AM, Jeff Eastman <>wrote:

> Toby Doig wrote:
>> I've run dirichlet commandline and now have an output folder with some
>> state-0, state-1, ... state-5 folders which each contain part-00000 and
>> .part-00000.crc files. However the  ClusteringYourData wiki page's
>> Retrieving the Output section just says TODO. I don't know how to turn
>> those
>> part files into something useful.
>> I successfully ran
>> the org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job test which
>> outputted data as text (to console at least) so I tried ripping the
>> printResults() methods from that class and putting them
>> in org.apache.mahout.clustering.dirichlet.DirichletJob but to no avail.
>> Can someone help?
>> Also, when running the commandline job it asks for the prototypeSize (-s
>> param) so when I converted my Lucene index to a vector file the output
>> said
>> it created 11 vectors, but with i specified that value for prototypeSize
>> the
>> job failed saying it found 1793 vectors. Changing the value i specify to
>> 1793 works but i now wonder why i need to specify it if it can figure it
>> out? Could it not be optional?
> Hi Toby,
> Each of the state-i directories contains a sequence file of the model
> states at the end of the i-th iteration. Since Dirichlet does not have a
> convergence criteria it will run for as many iterations as you select.
> Interpreting the results is also challenged by the fact that points are not
> assigned uniquely to a model - as in kmeans - or even with a probability -
> as in fuzzy kmeans. Each model does retain the number of points that it
> captured in that iteration - not the points themselves - so it is possible
> to back-fit the points to see which were the most likely to be captured by
> using the model's pdf() function and taking the top n points. Of course,
> that won't scale but check out TestL1ModelClustering in utils/ for some code
> that I used.
> The ClusterDumper is not able to dump the Dirichlet clusters though there
> is an issue to do this (MAHOUT-270) which is not yet completed. I'm working
> on it though, and you are welcome to make suggestions. Currently I'm trying
> to refactor the term priorities and other stuff in ClusterDumper to work
> with the Printable interface rather than relying upon ClusterBase.
> The prototype and prototypeSize arguments give you a way to specify the
> class and size of the Vectors which underly the existing models. One could
> probably glean this information by inspecting the first data element
> presented to the algorithm at initialization time. There is at this time no
> connection between the Lucene index to Vector transformation in utils and
> the Dirichlet job in core/ and no obvious way to introduce one given the
> dependencies.
> Code suggestions and patches to improve this all are of course welcome,
> Jeff

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message