mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <shashik...@gmail.com>
Subject Re: Failure to run Clustering example
Date Thu, 14 May 2009 10:25:30 GMT
I get your point.  Thanks you.

I am using Eucleadean Distance.

--shashi

On Thu, May 14, 2009 at 1:51 AM, Jeff Eastman
<jdog@windwardsolutions.com> wrote:
> I think the "optimum" value for these parameters is pretty subjective. You
> may find some estimation procedures that will give you values you like some
> times, but canopy will put every point into a cluster so the number of
> clusters is very sensitive to these values. I don't think normalizing your
> vectors will help, since you need to normalize all vectors in your corpus by
> the same amount. You might then find t1 and t2 values always on 0..1 but the
> number of clusters will still be sensitive to your choices on this range and
> you will be dealing with decimal values.
>
> It really depends upon how "similar" the documents in your corpus are and
> how fine a distinction you want to draw between documents before declaring
> them "different". What kind of distance measure are you using? A cosine
> distance measure will always give you distances on 0..1.
>
> Jeff
>

Mime
View raw message