mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: SVD and Clustering
Date Tue, 06 Jul 2010 15:35:54 GMT
The logic is that all documents are in the same language so that the most
common words in that language will provide a profile that dominates all
content specific patterns.  The vector for each document should therefore be
roughly a scaled version of the content-neutral, language-specific vector
plus some corrections for the actual content variations.  This dominant
vector should be the first eigenvector.

As I look at it again, this argument is not so strong.  Or perhaps, we
should consider it "subtle".

First, for linear term weighting (traditional tf.idf), all of the components
will scale with length and the argument is hooey.

For log(tf+1) . idf, on the other hand doubling the length of the document
will give an offset to common words because they already have tf > 1 so
log(tf + 1) \approx log(tf).  For less common words, we will see a
transition from tf=0 to tf=1 as the length occurs.  This will definitely
come much closer to isolating length and universal word frequency patterns
in the first eigenvector.

A second problem could come about if your stop word list really is removing
all non-content words.  Obviously, for a short list, it won't be removing
all of the words common across all documents so you will still have the
remnant of the pattern I mentioned for log(tf+1), but it might conceivably
be decimated to the point of not being the first eigenvector.  As such, you
might view dropping the first eigenvector as roughly equivalent to using a
stop word list.


On Tue, Jul 6, 2010 at 12:04 AM, Jake Mannix <jake.mannix@gmail.com> wrote:

> Hey Ted,
>
>  What is the reasoning again that dropping the first eigenvector is
> equivalent
> to forgetting about normalization / centering?  I have heard this before,
> but
> don't know if the math pops out at me... (isn't the first right singular
> vector
> also effectively the "TextRank" vector, for suitable input documents?)
>
>  -jake
>
> On Tue, Jul 6, 2010 at 8:27 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > Related to normalization, the original LSA team claimed better results
> with
> > tf.idf weighting.  I would tend to use log(1+tf) . idf instead.  I think
> > that term weighting of this sort is quite common.
> >
> > Document level normalization is a bit less common.  It is common
> practice,
> > however, to not normalize documents but instead to drop the first
> > eigenvector on the theory that is where the document norm winds up
> anyway.
> >  I would imagine that normalizing documents to some degree would make the
> > numerics of computing the SVD a bit better and save the extra work of
> > computing and then throwing away that eigenvector.  The first eigenvector
> > also takes the load of centering the documents.
> >
> > I do know that I have forgotten to toss that first eigenvector on several
> > occasions and been mystified for a time at how my results weren't as
> good.
> >
> > On Mon, Jul 5, 2010 at 11:16 PM, Jake Mannix <jake.mannix@gmail.com>
> > wrote:
> >
> > > In my own experience, things like graphs (including bipartite graphs
> like
> > > ratings matrices) I normalized before *and* after, but text I don't
> > (unit)
> > > normalize before, but do normalize after.
> > >
> > > The reasoning I use is that normalizing the rows of graphs has
> > > a meaning in the context of the graph (you're doing the PageRank-like
> > > thing of normalizing outflowing probability when looking at random
> > > walks, for example, or for ratings matrices, you're saying that
> > > everyone gets "one vote" to distribute amongst the things they've
> > > rated [these apply for doing L_1 normalization, which isn't always
> > > appropriate]), while I don't know if I buy the similar description of
> > > what pre-normalizing the rows of a text corpus.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message