mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: Failure to run Clustering example
Date Wed, 13 May 2009 20:21:55 GMT
I think the "optimum" value for these parameters is pretty subjective. 
You may find some estimation procedures that will give you values you 
like some times, but canopy will put every point into a cluster so the 
number of clusters is very sensitive to these values. I don't think 
normalizing your vectors will help, since you need to normalize all 
vectors in your corpus by the same amount. You might then find t1 and t2 
values always on 0..1 but the number of clusters will still be sensitive 
to your choices on this range and you will be dealing with decimal values.

It really depends upon how "similar" the documents in your corpus are 
and how fine a distinction you want to draw between documents before 
declaring them "different". What kind of distance measure are you using? 
A cosine distance measure will always give you distances on 0..1.


Shashikant Kore wrote:
> Thank you, Jeff. Unfortunately, I don't have an option of using EC2.
> Yes, t1 and t2 values were low.  Increasing these values helps. From
> my observations, the values of t1 and t2  need to be tuned depnding on
> data set. If the values of t1 and t2 for 100 documents are used for
> the set of 1000 documents, the runtime is affected.
> Is there any algorithm to find the "optimum" t1 and t2 values for
> given data set?  Ideally, if all the distances are normalized (say in
> the range of 1 to 100), using same distance thresholds across data set
> of various sizes should work fine.  Is this statement correct?
> More questions as I dig deeper.
> --shashi
> On Tue, May 12, 2009 at 3:22 AM, Jeff Eastman
> <> wrote:
>> I don't see anything obviously canopy-related in the logs. Canopy serializes
>> the vectors but the storage representation should not be too inefficient.
>> If T1 and T2 are too small relative to your observed distance measures you
>> will get a LOT of canopies, potentially one per document. How many did you
>> get in your run? For 1000 vectors of 100 terms; however, it does seem that
>> something is unusual here. I've run canopy (on a 12 node cluster) with
>> millions of 30-element DenseVector input points and not seen these sorts of
>> numbers. It is possible you are thrashing your RAM. Have you thought about
>> getting an EC2 instance or two? I think we are currently ok with elastic MR
>> too but have not tried that yet.
>> I would not expect the reducer to start until all the mappers are done.
>> I'm back stateside Wednesday from Oz and will be able to take a look later
>> in the week. I also notice canopy still has the combiner problem we fixed in
>> kMeans and won't work if the combiner does not run. It's darned unfortunate
>> there isn't an option to require the combiner. More to think about...
>> Jeff

View raw message