mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: How To get the Documents from generated Cluster
Date Wed, 14 Jul 2010 16:18:14 GMT
If you ran the kmeans clustering algorithm (the default) then you need 
to add a -cl option to obtain the clustered documents in the 
output/clusteredPoints directory. Run bin/mahout/kmeans to see the 
command line help:

  [--input <input> --clusters <clusters> --output <output> 
--distanceMeasure
<distanceMeasure> --convergenceDelta <convergenceDelta> --maxIter <maxIter>
--maxRed <maxRed> --k <k> --overwrite --help --clustering]
Options
   --input (-i) input                    Path to job input directory.
   --clusters (-c) clusters           The input centroids, as Vectors.
                                                Must be a SequenceFile of
                                                Writable, 
Cluster/Canopy.  If k
                                                is also specified, then 
a random
                                                set of vectors will be 
selected
                                                and written out to this 
path
                                                first
   --output (-o) output              The directory pathname for output.
   --distanceMeasure (-dm) distanceMeasure      The classname of the
                                                DistanceMeasure. Default 
is SquaredEuclidean
   --convergenceDelta (-cd) convergenceDelta    The convergence delta 
value. Default is 0.5
   --maxIter (-x) maxIter         The maximum number of iterations.
   --maxRed (-r) maxRed         The number of reduce tasks.  Defaults to 2
   --k (-k) k                               The k in k-Means.  If 
specified,
                                                then a random selection 
of k
                                                Vectors will be chosen 
as the
                                                Centroid and written to the
                                                clusters input path.
   --overwrite (-ow)                 If present, overwrite the output
                                                directory before running 
job
   --help (-h)                             Print out help
   --clustering (-cl)                    If present, run clustering after
                                                the iterations have 
taken place

If you run LDA on your documents it has its own output mechanism, ldatopics.
Jeff

On 7/14/10 7:41 AM, Grant Ingersoll wrote:
> How did you run the command?  Also, how did you run your clustering?  I believe (I'm
not looking at output at the moment), but I believe there is a points directory created and
it contains the results.  I believe it is all captured on the Wiki under the algorithms->clustering
section.
>
> -Grant
>
> On Jul 12, 2010, at 6:41 AM, Amit Kolhe wrote:
>
>    
>> Hi All,
>>
>>
>>
>> I had tried Reuters clustering example using bin/build-reuters.sh.
>>
>>
>>
>> Its run successfully now my question is how i can get the document list from
>> generated clusters.
>>
>>
>>
>> In Cluster dump step its show out-put like below.please help me to
>> understand this.
>>
>>
>>
>> :C-14565: [0:0.021, 00:0.071, 00.03:0.022, 00.13:0.024, 00.36:0.023,
>> 00.45:0.023, 00.49:0.047, 00.80:
>>
>>         Top Terms:
>>
>>                 vs                                      =>
>> 7.523020040478454
>>
>>                 loss                                    =>
>> 5.752297657262768
>>
>>                 oper                                    =>
>> 5.442479698123499
>>
>>                 net                                     =>
>> 5.263766102586645
>>
>>                 cts                                     =>
>> 5.031350276932608
>>
>>                 shr                                     =>
>> 4.535172744121599
>>
>>                 mln                                     =>
>> 3.9992122872350198
>>
>>                 qtr                                     =>
>> 3.726620754607078
>>
>>                 profit                                  =>
>> 3.653005141154945
>>
>>                 revs                                    =>
>> 3.6289278566086622
>>
>>                 dlrs                                    =>
>> 3.545725744977706
>>
>>                 note                                    =>
>> 3.507477366353763
>>
>>                 excludes                                =>
>> 2.5428448984544882
>>
>>                 includes                                =>
>> 2.4644159538019212
>>
>>                 avg                                     =>
>> 2.433302790452011
>>
>>                 shrs                                    =>
>> 2.4035651785900973
>>
>>                 gain                                    =>
>> 2.3975824745235874
>>
>>                 4th                                     =>
>> 2.3647150188609394
>>
>>                 mths                                    =>
>> 2.171140240782154
>>
>>                 year                                    =>
>> 2.14992541570207
>>
>>
>>
>> What I understand with clustering that it will generate the small cluster
>> for similar documents or stories like Google news if so then how to display
>> real document results.
>>
>>
>>
>>
>>
>> Thanks and Regards,
>>
>> Amit Kolhe
>>
>>      
>
>    


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message