mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Brücke <christoph.brue...@campus.tu-berlin.de>
Subject Re: Canopy Generation
Date Tue, 28 Jun 2011 09:03:07 GMT
Hi Mark,

the T1 threshold should be strict larger than the T2 one (T1 > T2). And yes the cluster
dumper utility should give you more than one cluster if present. The output looks like:
CL-0 { n=116 c=[29.922, 30.407, 30.373, 30.094, 29.886, ...] r=[3.463, 3.351, 3.452, 3.438,
3.371, ...] }
CL-1 { n=... c=[... , ...] r=[3... , ...] }

Whereas CL-0 is the cluster id, n is the number of vectors within the cluster, c is the centroid
and r is the radius.

Am 27.06.2011 um 16:33 schrieb Mark:

> My input data is a bunch of product item titles so I first created sparse vectors seq2sparse:
> 
> bin/mahout seq2sparse -i sequence-input -o sparse-output -ow output 5 -md 2 -wt TFIDF
-n 2 -ml 50 -nr 2 -ng 4 -seq -nv -x 80
> 
> 
> I then generated canopies:
> 
> mahout canopy -i sequence-input/tfidf-vectors -o canopies -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
-ow -xm sequential -t1 100 -t2 200
> 
> 
> Also tried 1, 2 for t1,t2 respectively.
> 
> I guess I'll have to play with some other sample data and configurations to see the results
I get. If everything goes well I should see multiple key/value pairs when inspecting the canopies
via ClusterDump correct?
> 
> Something like this?
> 
> Key: C-0: Value: C-0: ...
> Key: C-1: Value: C-1: ...
> Key: C-2: Value: C-2: ...
> 
> 
> Thanks
> 
> 
> On 6/27/11 2:12 AM, Christoph Brücke wrote:
>> Hi,
>> 
>> usually, regarding the input data, there should be more than just one cluster. You
may use the cluster dumper utility to output the cluster data.  (https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper)
>> 
>> It seems that your t1 and t2 threshold for the canopies are chosen to large, so that
all data points are assigned to just one canopy. Could you describe your input data (number
of dimensions, range, distribution, ...) and give the parameters you used for the clustering?
>> 
>> Regards,
>> Christoph
>> 
>> Am 27.06.2011 um 00:40 schrieb Mark:
>> 
>>> Is there an easy way to know hot many canopies where generated after running
the canopy generation tool?
>>> 
>>> I tried viewing the file clusters-0/part-r-00000 via seqdumper but it always
returns:
>>> 
>>> Key: C-0: Value: C-0: {437:0.005630003188145648,478:0.006034746778989781,591:0.020761514762446885...
>>> Count: 1
>>> 
>>> Should there be multiple key value pairs or just this one?
>>> 
>>> Thanks
>>> 
>>> 
>> 
> 



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message