mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Fernández <fernando.fernandez.gonza...@gmail.com>
Subject Re: tf-idf + svd + cosine similarity
Date Wed, 15 Jun 2011 10:57:00 GMT
I think that LanczosSolver provides negative values as well, I don't know
about SSVD.

I guess that if similarity has a high negative value, you can say that
documents talk about things that almost never appear together in the same
text (if term A appears, then term B won't appear), but I think this is
almost impossible in practice (at least the most extreme case with
similiarity=-1), as there are always common expressions that appear in many
documents. I think that's why avg(similiarity) is always above 0 in your
case.

2011/6/15 Sean Owen <srowen@gmail.com>

> The features all take on non-negative values here, right?
> Then the cosine can't be negative.
>
> In another context, where features could be negative, cosine could
> indeed be negative. -1 means most dissimilar of all -- the feature
> vectors are exactly opposed.
>
> On Wed, Jun 15, 2011 at 10:10 AM, Stefan Wienert <stefan@wienert.cc>
> wrote:
> > Ignoring is no option... so I have to interpret these values.
> > Can one say that documents with similarity = -1 are the less similar
> > documents? I don't think this is right.
> > Any other assumptions?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message