mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Wienert <ste...@wienert.cc>
Subject Re: tf-idf + svd + cosine similarity
Date Tue, 14 Jun 2011 18:00:03 GMT
but... why do I get the different results with cosine similarity with
no dimension reduction (with 100,000 dimensions) ?

2011/6/14 Fernando Fernández <fernando.fernandez.gonzalez@gmail.com>:
> Actually that's what your results are showing, aren't they? With rank 1000
> the similarity avg is the lowest...
>
>
> 2011/6/14 Jake Mannix <jake.mannix@gmail.com>
>
>> actually, wait - are your graphs showing *similarity*, or *distance*?  In
>> higher
>> dimensions, *distance* (and cosine angle) should grow, but on the other
>> hand,
>> *similarity* (1-cos(angle)) should go toward 0.
>>
>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <stefan@wienert.cc>
>> wrote:
>>
>> > Hey Guys,
>> >
>> > I have some strange results in my LSA-Pipeline.
>> >
>> > First, I explain the steps my data is making:
>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
>> > weighter
>> > 2) Transposing TDM
>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>> > 3c) Using no dimension reduction (for testing purpose)
>> > 4) Transpose result (ONLY none / svd)
>> > 5) Calculating Cosine Similarty (from Mahout)
>> >
>> > Now... Some strange thinks happen:
>> > First of all: The demo data shows the similarity from document 1 to
>> > all other documents.
>> >
>> > the results using only cosine similarty (without dimension reduction):
>> > http://the-lord.de/img/none.png
>> >
>> > the result using svd, rank 10
>> > http://the-lord.de/img/svd-10.png
>> > some points falling down to the bottom.
>> >
>> > the results using ssvd rank 10
>> > http://the-lord.de/img/ssvd-10.png
>> >
>> > the result using svd, rank 100
>> > http://the-lord.de/img/svd-100.png
>> > more points falling down to the bottom.
>> >
>> > the results using ssvd rank 100
>> > http://the-lord.de/img/ssvd-100.png
>> >
>> > the results using svd rank 200
>> > http://the-lord.de/img/svd-200.png
>> > even more points falling down to the bottom.
>> >
>> > the results using svd rank 1000
>> > http://the-lord.de/img/svd-1000.png
>> > most points are at the bottom
>> >
>> > please beware of the scale:
>> > - the avg from none: 0,8712
>> > - the avg from svd rank 10: 0,2648
>> > - the avg from svd rank 100: 0,0628
>> > - the avg from svd rank 200: 0,0238
>> > - the avg from svd rank 1000: 0,0116
>> >
>> > so my question is:
>> > Can you explain this behavior? Why are the documents getting more
>> > equal with more ranks in svd. I thought it was the opposite.
>> >
>> > Cheers
>> > Stefan
>> >
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Mime
View raw message