mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Wienert <ste...@wienert.cc>
Subject tf-idf + svd + cosine similarity
Date Tue, 14 Jun 2011 17:15:09 GMT
Hey Guys,

I have some strange results in my LSA-Pipeline.

First, I explain the steps my data is making:
1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as weighter
2) Transposing TDM
3a) Using Mahout SVD (Lanczos) with the transposed TDM
3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
3c) Using no dimension reduction (for testing purpose)
4) Transpose result (ONLY none / svd)
5) Calculating Cosine Similarty (from Mahout)

Now... Some strange thinks happen:
First of all: The demo data shows the similarity from document 1 to
all other documents.

the results using only cosine similarty (without dimension reduction):
http://the-lord.de/img/none.png

the result using svd, rank 10
http://the-lord.de/img/svd-10.png
some points falling down to the bottom.

the results using ssvd rank 10
http://the-lord.de/img/ssvd-10.png

the result using svd, rank 100
http://the-lord.de/img/svd-100.png
more points falling down to the bottom.

the results using ssvd rank 100
http://the-lord.de/img/ssvd-100.png

the results using svd rank 200
http://the-lord.de/img/svd-200.png
even more points falling down to the bottom.

the results using svd rank 1000
http://the-lord.de/img/svd-1000.png
most points are at the bottom

please beware of the scale:
- the avg from none: 0,8712
- the avg from svd rank 10: 0,2648
- the avg from svd rank 100: 0,0628
- the avg from svd rank 200: 0,0238
- the avg from svd rank 1000: 0,0116

so my question is:
Can you explain this behavior? Why are the documents getting more
equal with more ranks in svd. I thought it was the opposite.

Cheers
Stefan

Mime
View raw message