mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <>
Subject Re: Failure to run Clustering example
Date Tue, 12 May 2009 10:22:15 GMT
Thank you, Jeff. Unfortunately, I don't have an option of using EC2.

Yes, t1 and t2 values were low.  Increasing these values helps. From
my observations, the values of t1 and t2  need to be tuned depnding on
data set. If the values of t1 and t2 for 100 documents are used for
the set of 1000 documents, the runtime is affected.

Is there any algorithm to find the "optimum" t1 and t2 values for
given data set?  Ideally, if all the distances are normalized (say in
the range of 1 to 100), using same distance thresholds across data set
of various sizes should work fine.  Is this statement correct?

More questions as I dig deeper.


On Tue, May 12, 2009 at 3:22 AM, Jeff Eastman
<> wrote:
> I don't see anything obviously canopy-related in the logs. Canopy serializes
> the vectors but the storage representation should not be too inefficient.
> If T1 and T2 are too small relative to your observed distance measures you
> will get a LOT of canopies, potentially one per document. How many did you
> get in your run? For 1000 vectors of 100 terms; however, it does seem that
> something is unusual here. I've run canopy (on a 12 node cluster) with
> millions of 30-element DenseVector input points and not seen these sorts of
> numbers. It is possible you are thrashing your RAM. Have you thought about
> getting an EC2 instance or two? I think we are currently ok with elastic MR
> too but have not tried that yet.
> I would not expect the reducer to start until all the mappers are done.
> I'm back stateside Wednesday from Oz and will be able to take a look later
> in the week. I also notice canopy still has the combiner problem we fixed in
> kMeans and won't work if the combiner does not run. It's darned unfortunate
> there isn't an option to require the combiner. More to think about...
> Jeff

View raw message