mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <>
Subject Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors
Date Wed, 17 Jun 2009 14:34:08 GMT
On Wed, Jun 17, 2009 at 6:35 PM, Benson Margulies<> wrote:
> As I read the paper, the idea here is to get a rough partitioning that is
> used to optimize various downstream algorithms, not to tune for a precise
> partitioning. The number of canopies doesn't need, as I read it, to be
> particularly close to the number of eventual partitions to be useful.
> Thus the extended discussion of how to start up and run various other
> algorithms, (e.g. k-means).

That's right. But here is my experience.  I ran Canopy and then
K-Means on 50k doc vectors. (That, by the way, is fraction of the
actual dataset.)  I used the code in the patch of 121, which uses
primitives for Sparse Vectors.

After some experimentation, for the t2 value of 0.9, I got only 1
cluster. When I changed it to 0.85, it generated 3000+ clusters(or
canopies). With increasing number of canopies the code starts
crawling. And after some time, even 2G memory is not sufficient for

Canopies is one of the simplest clustering algorithm and I had trouble
getting it work. May be it's my data set. I simply didn't had the
patience to find out all the values of t1 and t2 which are anyway
going to change when the input changes. So, for now, I have just put a
cap on the number of canopies generated.  Not elegant, but results
don't seem bad at all.

OK. Now, let's not focus on my ignorance. I have got my hands dirty
with  Machine Learning, Mahout and Hadoop barely few days back.



View raw message