mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: clusterdump lucene document ID
Date Mon, 11 Jun 2012 15:07:28 GMT
It should be creating a NamedVector using what is passed in from the idField, in your case
_uid.  That field must be stored.  If that field is null, then it uses the internal Lucene
id.  Those named vectors should be preserved across all operations.  What's your output from
your last step look like?


On May 11, 2012, at 12:30 AM, Benjamin Busjaeger wrote:

> I am trying to cluster documents stored in a lucene index using the command line tools.
How can I obtain the original document IDs from the clustering output?
> 
> 
> Here is the sequence of commands I am using:
> 
> ./mahout lucene.vector --dir $index_path --output /tmp/mahout/vector --field content
--dictOut /tmp/mahout/dict --idField _uid -md 2 -w TFIDF -x 70
> 
> ./mahout canopy -i /tmp/mahout/vector -o /tmp/mahout_canopy -dm org.apache.mahout.common.distance.CosineDistanceMeasure
--t1 10 --t2 5
> 
> ./mahout kmeans -i /tmp/mahout/vector -c /tmp/mahout_canopy/clusters-0-final/part-r-00000
-o /tmp/mahout_kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -k 20 -x
20 -cd 0.1
> 
> ./mahout clusterdump -dt text -d /tmp/mahout/dict -s /tmp/mahout_kmeans/clusters-1-final/
-b 20 -n 20
> 
> 
> A similar question was asked on this thread [1], but I did not see a resolution. Thanks
in advance for your help!
> 
> - Ben
> 
> 
> [1] http://mail-archives.apache.org/mod_mbox/mahout-user/201204.mbox/%3CCA+y9ocWgS2se7dOqQrsE3p+QE5GVXCt8XUTucFdZvGkJkPOaew@mail.gmail.com%3E


--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message