Hey Ted,
What is the reasoning again that dropping the first eigenvector is
equivalent
to forgetting about normalization / centering? I have heard this before,
but
don't know if the math pops out at me... (isn't the first right singular
vector
also effectively the "TextRank" vector, for suitable input documents?)
jake
On Tue, Jul 6, 2010 at 8:27 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Related to normalization, the original LSA team claimed better results with
> tf.idf weighting. I would tend to use log(1+tf) . idf instead. I think
> that term weighting of this sort is quite common.
>
> Document level normalization is a bit less common. It is common practice,
> however, to not normalize documents but instead to drop the first
> eigenvector on the theory that is where the document norm winds up anyway.
> I would imagine that normalizing documents to some degree would make the
> numerics of computing the SVD a bit better and save the extra work of
> computing and then throwing away that eigenvector. The first eigenvector
> also takes the load of centering the documents.
>
> I do know that I have forgotten to toss that first eigenvector on several
> occasions and been mystified for a time at how my results weren't as good.
>
> On Mon, Jul 5, 2010 at 11:16 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > In my own experience, things like graphs (including bipartite graphs like
> > ratings matrices) I normalized before *and* after, but text I don't
> (unit)
> > normalize before, but do normalize after.
> >
> > The reasoning I use is that normalizing the rows of graphs has
> > a meaning in the context of the graph (you're doing the PageRanklike
> > thing of normalizing outflowing probability when looking at random
> > walks, for example, or for ratings matrices, you're saying that
> > everyone gets "one vote" to distribute amongst the things they've
> > rated [these apply for doing L_1 normalization, which isn't always
> > appropriate]), while I don't know if I buy the similar description of
> > what prenormalizing the rows of a text corpus.
> >
>
