mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: Dirichlet Process Clustering not working
Date Tue, 18 Oct 2011 16:23:47 GMT
Check out TestClusterDumper.testDirichlet2&3 for an example of text clustering using DPC.
It produces reasonable looking clusters when compared with k-means and the other algorithms,
but on a small vocabulary. Also check out DisplayDirichlet, which does a great job of clustering
some random 2-d data. 

I'd suggest trying the default 1.0 alpha as is done in the cluster dumper tests. Also, the
default model is GaussianCluster and it may not perform well with a large feature space. Check
the pdf() function which uses the product of the component pdfs to produce the composite value
for each cluster. This may not be optimal for really large term vectors. How many elements
are in your term vectors? You may need to create your own model and model distribution to
make DPC perform on your data.

Jeff

-----Original Message-----
From: edward choi [mailto:mp2893@gmail.com] 
Sent: Tuesday, October 18, 2011 7:06 AM
To: user@mahout.apache.org
Subject: Dirichlet Process Clustering not working

Hi,

This is my first time using Mahout, though it's been over a year playing
with Hadoop and Hbase.

I collected several hundred thousand news articles from RSS. And I wanted to
do a dirichlet process clustering(DPC) with them.
I did as the mahout wiki told me to do. (Making sequence files from normal
documents, then making them into vectors, and then doing DPC, then finally
clusterdumping)
My DPC setting was: 20 clusters. 10 iterations. 2.0 alpha. clustering true,
emitMostLikely false. No modelDist, modelPrototype, distanceMeasure was
specified.
Number of documents were 5896. (I preprocessed the docs so that they would
only contain verbs and nouns).
The result was not what i had expected.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
C-0: GC:0{n=5896 c=[0:0.001, 0.07:0.000, 0.08:0.000, 0
.......................
    Top Terms:
        comment                                 =>0.015425061539023016
        2011                                    =>0.011413068888273332
        reserve                                 =>0.011253999429472274
        rights                                  => 0.01115527360420605
        use                                     =>0.010942002711960384
        rights reserve                          =>0.010882667414113879
        copyright                               =>0.010399572042096333
        publish                                 =>0.009924242339732702
        time                                    => 0.00988611270657134
        material                                =>0.009849842593611612

C-1: GC:1{n=0 c=[0:-0.239, 0.07:0.775, 0.08:-0.767,.....
    Top Terms:.......

C-10: GC:10{n=0 c=[0:-1.116, 0.07:-0............
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is what the clusterdump looks like. To my understanding, this means
that all the documents were assigned to one cluster point, namely C-0.
I changed the DPC settings around. I also changed the process of making
vectors a bit, but always the same result.
I was so out of clue, I tried Kmeans with the exact same documents and
vectors. And they worked!!! I don't know how I am supposed to understand
this.
I looked up google but there was no definite solution so I guess everybody
else is doing fine with DPC.

Please could someone tell me what I am doing wrong? (oh, and I am using
standalone mode with Mahout)

Regards,
Ed

Mime
View raw message