mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <>
Subject Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors
Date Wed, 17 Jun 2009 13:05:55 GMT
All I know is what I learned from reading the paper. However, I continue to
think, from reading the paper, that you may be trying to make Canopy do
something it was not intended to do.

As I read the paper, the idea here is to get a rough partitioning that is
used to optimize various downstream algorithms, not to tune for a precise
partitioning. The number of canopies doesn't need, as I read it, to be
particularly close to the number of eventual partitions to be useful.

Thus the extended discussion of how to start up and run various other
algorithms, (e.g. k-means).

Now, still, you need to get some useful number of partitions. The paper has
a classic toss-off line, 'we used cross-validation,' without any details
about exactly what the authors did. Presumably, that means that the author
ran many possible values and hand-examined the results. The paper reports no
general results about how sensitive the T values are to particular input
data sets. A pessimist would fear that, for any new input, you're going to
need to go through a lengthy process to find good values for T1 and T2.

This leads me to wonder, ignorantly, why this project is so focused on
Canopy. The paper describes it as a tool for speeding up various other
things. Since you're hadooping all those other things, how much does it

Anyway, I expect that my ignorance is on comprehensive display here.

On Wed, Jun 17, 2009 at 7:16 AM, Grant Ingersoll <>wrote:

> Shashikant asked this over on mahout-dev, but I thought I would move it to
> user so that others can benefit from the discussion.
> On Jun 17, 2009, at 1:12 AM, Shashikant Kore (JIRA) wrote:
>> Shashikant Kore commented on MAHOUT-121:
>> ----------------------------------------
>> [OT] Also, was wondering how you came up with the values of t1 and t2 as
>> 1.3 & 1.0. This is  voodoo for me. For the dataset I am working with has a
>> window of 0.05 in which the result changes from 0 canopies to 3,000
>> canopies.
> I just picked some numbers based on what you did!  It is voodoo to me too.
>  I have not done much clustering, so I'm learning a lot here.  As for
> MAHOUT-121, I just wanted something to run.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message