mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DAN HELM <danielh...@verizon.net>
Subject Re: K-Means generates only one cluster
Date Fri, 19 Oct 2012 21:32:48 GMT
To look at vectors you can check out the data in the "clusteredPoints" folder generated by
k-means.  You can write the data out in text format via the seqdumper command (as shown in
step 5 here): http://amgadmadkour.blogspot.com/2012/07/kmeans-clustering-using-apache-mahout.html
 
The clusteredPoints output shows the topics each document was assigned including the distance score and
I believe it also lists the document vectors (term:weight pairs).
 
You could also dump out the sparse vectors used as input to k-means via the seqdumper command
running against part files in your tfidf-vectors folder, e.g., 
 
mahout seqdumper -s ..../reuters-vectors/tfidf-vectors/part-r-00000 > vectors.txt
 
I believe there is also a vectordump command that can also be used to dump out vectors in
text format.
 
It is always good to know what kind of input (vectors) you were feeding k-means, to make sure
that was not causing the problems.
 
Dan
 

________________________________
 From: syed kather <in.abdul@gmail.com>
To: user@mahout.apache.org; DAN HELM <danielhelm@verizon.net> 
Sent: Friday, October 19, 2012 8:16 AM
Subject: Re: K-Means generates only one cluster
  

 Thanks  Dan .. 
 Yes i had tried tanimoto that gives 6 cluster .  


" It appeared for our data after our custom
lucene analyzer and the tfidf filtering was applied (in seq2sparse command) all
terms for many of our documents were removed.  These were documents that had minimal (and/or
garbage) text to begin "
  We had also did the same way clearing the junck from the original documents and even we
had removed the stop words . But i our case there is no use .  

 How to verify the vector ?  Can you suggest me please .. 

           Thanks and Regards,
        S SYED ABDUL KATHER 
               



On Fri, Oct 19, 2012 at 9:20 AM, DAN HELM <danielhelm@verizon.net> wrote:

We previously did some k-means clustering runs on
>different sized collections and noticed how that a large cluster was often created
>along with some smaller others. In digging deeper it turned out a lot of the
>document vectors (produced via the seq2sparse command) were null (empty).  k-means apparently
put these together in one large
>cluster.  I also saw NaN for computed distances
>for these vectors.  And in the “clusteredPoints”
>file, it was clear many vectors were empty.  It appeared for our data after our custom
>lucene analyzer and the tfidf filtering was applied (in seq2sparse command) all
>terms for many of our documents were removed.  These were documents that had minimal
(and/or garbage) text to begin
>with.
>So, maybe first verify if you are getting
>proper vectors for the input to k-means. We ended up cleaning up the vectors
>before clustering them (tossing out the null ones). You can also experiment
>with different similarity measures in k-means too (e.g., tanimoto).
> Dan 
>
>________________________________
> From: syed kather <in.abdul@gmail.com>
>To: user@mahout.apache.org
>Cc: Raja Ramesh <raja@pointcross.com>
>Sent: Thursday, October 18, 2012 11:03 PM
>Subject: K-Means generates only one cluster
>
>
>Team
>
>    Version Used : Mahout 0.6
>    Hadoop : 5 Nodes(1 Master + 4 Slaves)
>
>    Once we had generated kmean clusters for 600000 documents.I had run the
>clusterdump, which will extract the top terms from the cluster, There i had
>noticed only one clusters is made even though we had specified the number
>of cluster to 10. I had cross check the commands with some 1000 documents
>and applied clustering. As i had notice that out of the 1000
>documents,mahout can able to generated 10 cluster.
>
>Some Observation which i had made on 600000 Data:-
>    In clusterdump I had added  "--pointDir <path>". Because this command
>will extactly tell us .what are top terms for each documents vise. In this
>i had noticed that some of the documents which doesnt have a distance.
>1.0 : [distance=NaN]: /0_6_1343_504071_6198107.txt =]
>  0_6_1343_504071_6198107.txt ==> File Name
>1.0 : [distance=NaN]: /0_6_1343_504071_6198108.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198109.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198110.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198111.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198112.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198113.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198114.txt =]
>1.0 : [distance=NaN]: /0_6_1343_504071_6198115.txt =]
>
>Have a look command which i had executed one is for huge data(600000) and
>one is for small data (1000 documents)
>
>#sequencial File generation
>bin/mahout seqdirectory -i /hugeData/hugeData/ -o /hugeData/SequenceFiles/
>-c UTF-8 -chunk 64   (600000 documents)
>bin/mahout seqdirectory -i /blrdata/blrdata/ -o /blrdata/SequenceFiles/ -c
>UTF-8 -chunk 64               (1000 documents)
>
>#Term Vector Creation.
>bin/mahout seq2sparse -i /hugeData/SequenceFiles/ -o
>/hugeData/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
>   (600000 doc)
>bin/mahout seq2sparse -i /blrdata/SequenceFiles/ -o
>/blrdata/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
>         (1000 documents)
>
>#Clustering
>bin/mahout kmeans -i /hugeData/SequenceFiles-sparse/tfidf-vectors/ -c
>/hugeData/kmeans-clusters -o /hugeData/kmeans -dm
>org.apache.mahout.common.distance.CosineDistanceMea0sure -x 10 -k 10 -ow
>--clustering                       (600000 documents)
>bin/mahout kmeans -i /blrdata/SequenceFiles-sparse/tfidf-vectors/ -c
>/blrdata/kmeans-clusters -o /blrdata/kmeans -dm
>org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow
>--clustering                        (1000 documents)
>
>#Cluster Dump
>bin/mahout clusterdump -s
>hdfs://localhost:9000/hugeData/kmeans/clusters-2-final/ -d
>hdfs://localhost:9000/hugeData/SequenceFiles-sparse/dictionary.file-0 -dt
>sequencefile -b 100 -n 100                                  (600000
>documents)
>bin/mahout clusterdump -s
>hdfs://localhost:9000/blrdata/kmeans/clusters-2-final/ -d
>hdfs://localhost:9000/blrdata/SequenceFiles-sparse/dictionary.file-0 -dt
>sequencefile -b 100 -n 10                                        (1000
>documents)
>
>I am using Map Reduced Method. For calculating K-Means.
>
>I had no clue what is going wrong. So please help me what i had missed in
>this.  please give me some suggestion how to check what goes wrong.
>
>
>Let me know if there is any further information is required
>
>Thanks in advance
>S SYED ABDUL KATHER
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message