mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: SVD and Clustering
Date Tue, 06 Jul 2010 07:04:44 GMT
Hey Ted,

  What is the reasoning again that dropping the first eigenvector is
equivalent
to forgetting about normalization / centering?  I have heard this before,
but
don't know if the math pops out at me... (isn't the first right singular
vector
also effectively the "TextRank" vector, for suitable input documents?)

  -jake

On Tue, Jul 6, 2010 at 8:27 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Related to normalization, the original LSA team claimed better results with
> tf.idf weighting.  I would tend to use log(1+tf) . idf instead.  I think
> that term weighting of this sort is quite common.
>
> Document level normalization is a bit less common.  It is common practice,
> however, to not normalize documents but instead to drop the first
> eigenvector on the theory that is where the document norm winds up anyway.
>  I would imagine that normalizing documents to some degree would make the
> numerics of computing the SVD a bit better and save the extra work of
> computing and then throwing away that eigenvector.  The first eigenvector
> also takes the load of centering the documents.
>
> I do know that I have forgotten to toss that first eigenvector on several
> occasions and been mystified for a time at how my results weren't as good.
>
> On Mon, Jul 5, 2010 at 11:16 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > In my own experience, things like graphs (including bipartite graphs like
> > ratings matrices) I normalized before *and* after, but text I don't
> (unit)
> > normalize before, but do normalize after.
> >
> > The reasoning I use is that normalizing the rows of graphs has
> > a meaning in the context of the graph (you're doing the PageRank-like
> > thing of normalizing outflowing probability when looking at random
> > walks, for example, or for ratings matrices, you're saying that
> > everyone gets "one vote" to distribute amongst the things they've
> > rated [these apply for doing L_1 normalization, which isn't always
> > appropriate]), while I don't know if I buy the similar description of
> > what pre-normalizing the rows of a text corpus.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message