mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: tf-idf + svd + cosine similarity
Date Wed, 15 Jun 2011 17:44:13 GMT
On Wed, Jun 15, 2011 at 10:06 AM, Stefan Wienert <stefan@wienert.cc> wrote:

> Hmm. Seems I have plenty of negative results (nearly half of the
> similarity). I can add +0.3 then the greatest negative results are
> near 0. This is not optimal...
> I can project the results to [0..1].
>

Looking for *dissimilar* results seems odd.  What are you trying to do?

What people normally do is look for clusters of similar documents, or
just the top-N most similar documents to each document.  In both of these
cases, you don't care about the documents whose similarity to anyone is
zero, or less than zero.

  -jake


> Any other suggestions or comments?
>
> Cheers
> Stefan
>
> 2011/6/15 Jake Mannix <jake.mannix@gmail.com>:
> > While your original vectors never had similarity less than zero, after
> > projection onto the SVD space, you may "project away" similarities
> > between two vectors, and they are now negatively correlated in this
> > space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector
> > space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)
> > to similarity -1).
> >
> > I always interpret all similarities <= 0 as "maximally dissimilar",
> > even if technically -1 is where this is exactly true.
> >
> >  -jake
> >
> > On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <stefan@wienert.cc>
> wrote:
> >
> >> Ignoring is no option... so I have to interpret these values.
> >> Can one say that documents with similarity = -1 are the less similar
> >> documents? I don't think this is right.
> >> Any other assumptions?
> >>
> >> 2011/6/15 Fernando Fernández <fernando.fernandez.gonzalez@gmail.com>:
> >> > One question that I think it has not been answered yet is that of the
> >> > negative simliarities. In literature you can find that similiarity=-1
> >> means
> >> > that "documents talk about opposite topics", but I think this is a
> quite
> >> > abstract idea... I just ignore them, when I'm trying to find top-k
> >> similar
> >> > documents these surely won't be useful. I read recently that this has
> to
> >> do
> >> > with the assumptions in SVD which is designed for normal distributions
> >> (This
> >> > implies the posibility of negative values). There are other techniques
> >> > (Non-negative factorization) that tries to solve this. I don't know if
> >> > there's something in mahout about this.
> >> >
> >> > Best,
> >> >
> >> > Fernando.
> >> >
> >> > 2011/6/15 Ted Dunning <ted.dunning@gmail.com>
> >> >
> >> >> The normal terminology is to name U and V in SVD as "singular
> vectors"
> >> as
> >> >> opposed to eigenvectors.  The term eigenvectors is normally reserved
> for
> >> >> the
> >> >> symmetric case of U S U'  (more generally, the Hermitian case, but
we
> >> only
> >> >> support real values).
> >> >>
> >> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> >> >> >wrote:
> >> >>
> >> >> > I beg to differ... U and V are left and right eigenvectors, and
> >> >> > singular values is denoted as Sigma (which is a square root of
> eigen
> >> >> > values of the AA' as you correctly pointed out) .
> >> >> >
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Stefan Wienert
> >>
> >> http://www.wienert.cc
> >> stefan@wienert.cc
> >>
> >> Telefon: +495251-2026838
> >> Mobil: +49176-40170270
> >>
> >
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message