mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: tf-idf + svd + cosine similarity
Date Tue, 14 Jun 2011 22:37:42 GMT
that actually looks more like it. Not so many documents  similar to a
randomly picked one.


On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <stefan@wienert.cc> wrote:
> Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see
> http://the-lord.de/img/beispielwerte.pdf
> for better results.
>
> First... U or V are the singular values not the eigenvectors ;)
>
> Lanczos-SVD in mahout is computing the eigenvectors of M*M (it
> multiplies the input matrix with the transposed one)
>
> As a fact, I don't need U, just V, so I need to transpose M (because
> the eigenvectors of MM* = V).
>
> So... normalizing the eigenvectors: Is the cosine similarity not doing
> this? or ignoring the length of the vectors?
> http://en.wikipedia.org/wiki/Cosine_similarity
>
> my parameters for ssvd:
> --rank 100
> --oversampling 10
> --blockHeight 227
> --computeU false
> --input
> --output
>
> the rest should be on default.
>
> acutally I do not really know what these oversampling parameter means...
>
> 2011/6/14 Dmitriy Lyubimov <dlieu.7@gmail.com>:
>> Interesting.
>>
>> (I have one confusion of mine RE: lanczos -- is it computing U
>> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
>> right. if it's V (right eigenvectors) this sequence should be fine).
>>
>> With ssvd i don't do transpose, i just do coputation of U which will
>> produce document singular vectors directly.
>>
>> Also, i am not sure that Lanczos actually normalizes the eigenvectors,
>> but SSVD does (or multiplies normalized version by square root of a
>> singlular value, whichever requested). So depending on which space
>> your rotate results in, cosine similarities may be different. I assume
>> you used normalized (true) eigenvectors from ssvd.
>>
>> Also would be interesting to know what oversampling parameter you (p) you used.
>>
>> Thanks.
>> -d
>>
>>
>> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <stefan@wienert.cc> wrote:
>>> So... lets check the dimensions:
>>>
>>> First step: Lucene Output:
>>> 227 rows (=docs) and 107909 cols (=tems)
>>>
>>> transposed to:
>>> 107909 rows and 227 cols
>>>
>>> reduced with svd (rank 100) to:
>>> 99 rows and 227 cols
>>>
>>> transposed to: (actually there was a bug (with no effect on the SVD
>>> result but on NONE result))
>>> 227 rows and 99 cols
>>>
>>> So... now the cosine results are very similar to SVD 200.
>>>
>>> Results are added.
>>>
>>> @Sebastian: I will check if the bug affects my results.
>>>
>>> 2011/6/14 Fernando Fernández <fernando.fernandez.gonzalez@gmail.com>:
>>>> Hi Stefan,
>>>>
>>>> Are  you sure you need to transpose the input marix? I thought that what
you
>>>> get from lucene index was already document(rows)-term(columns) matrix, but
>>>> you say that you obtain term-document matrix and transpose it. Is this
>>>> correct? What are you using to obtain this matrix from Lucene? Is it
>>>> possible that you are calculating similarities with the wrong matrix in some
>>>> of the two cases? (With/without dimension reduction).
>>>>
>>>> Best,
>>>> Fernando.
>>>>
>>>> 2011/6/14 Sebastian Schelter <ssc@apache.org>
>>>>
>>>>> Hi Stefan,
>>>>>
>>>>> I checked the implementation of RowSimilarityJob and we might still have
a
>>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused
by
>>>>> that, but the similarity scores might not be correct...
>>>>>
>>>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>>>> were mapped out inconsistently, so for 0.5 we made sure that we always
map
>>>>> the smaller row as first value. But apparently I did not adjust the value
>>>>> setting for the Cooccurrence object...
>>>>>
>>>>> In 0.5 the code is:
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>  }
>>>>>  coocurrence.set(column.get(), valueA, valueB);
>>>>>
>>>>> But I should be (already fixed in current trunk some days ago):
>>>>>
>>>>>  if (rowA <= rowB) {
>>>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>>>   coocurrence.set(column.get(), valueA, valueB);
>>>>>  } else {
>>>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>>>   coocurrence.set(column.get(), valueB, valueA);
>>>>>  }
>>>>>
>>>>> Maybe you could rerun your test with the current trunk?
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>>>
>>>>>> It is a similarity, not a distance. Higher values mean more
>>>>>> similarity, not less.
>>>>>>
>>>>>> I agree that similarity ought to decrease with more dimensions. That
>>>>>> is what you observe -- except that you see quite high average
>>>>>> similarity with no dimension reduction!
>>>>>>
>>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>>>> but a few dimensions. What's the dimensionality of the input without
>>>>>> dimension reduction?
>>>>>>
>>>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>>>
>>>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<stefan@wienert.cc>
>>>>>>  wrote:
>>>>>>
>>>>>>> Actually I'm using  RowSimilarityJob() with
>>>>>>> --input input
>>>>>>> --output output
>>>>>>> --numberOfColumns documentCount
>>>>>>> --maxSimilaritiesPerRow documentCount
>>>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>>>
>>>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>>>> calculates...
>>>>>>> the source says: "distributed implementation of cosine similarity
that
>>>>>>> does not center its data"
>>>>>>>
>>>>>>> So... this seems to be the similarity and not the distance?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2011/6/14 Stefan Wienert<stefan@wienert.cc>:
>>>>>>>
>>>>>>>> but... why do I get the different results with cosine similarity
with
>>>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>>>
>>>>>>>> 2011/6/14 Fernando Fernández<fernando.fernandez.gonzalez@gmail.com>:
>>>>>>>>
>>>>>>>>> Actually that's what your results are showing, aren't
they? With rank
>>>>>>>>> 1000
>>>>>>>>> the similarity avg is the lowest...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2011/6/14 Jake Mannix<jake.mannix@gmail.com>
>>>>>>>>>
>>>>>>>>>  actually, wait - are your graphs showing *similarity*,
or *distance*?
>>>>>>>>>>  In
>>>>>>>>>> higher
>>>>>>>>>> dimensions, *distance* (and cosine angle) should
grow, but on the
>>>>>>>>>> other
>>>>>>>>>> hand,
>>>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<stefan@wienert.cc>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hey Guys,
>>>>>>>>>>>
>>>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>>>
>>>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene
datastore using TFIDF
>>>>>>>>>>> as
>>>>>>>>>>> weighter
>>>>>>>>>>> 2) Transposing TDM
>>>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed
TDM
>>>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the
transposed TDM
>>>>>>>>>>> 3c) Using no dimension reduction (for testing
purpose)
>>>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>>>
>>>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>>>> First of all: The demo data shows the similarity
from document 1 to
>>>>>>>>>>> all other documents.
>>>>>>>>>>>
>>>>>>>>>>> the results using only cosine similarty (without
dimension
>>>>>>>>>>> reduction):
>>>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 10
>>>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 10
>>>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>>>
>>>>>>>>>>> the result using svd, rank 100
>>>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using ssvd rank 100
>>>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 200
>>>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>>>
>>>>>>>>>>> the results using svd rank 1000
>>>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>>>> most points are at the bottom
>>>>>>>>>>>
>>>>>>>>>>> please beware of the scale:
>>>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>>>
>>>>>>>>>>> so my question is:
>>>>>>>>>>> Can you explain this behavior? Why are the documents
getting more
>>>>>>>>>>> equal with more ranks in svd. I thought it was
the opposite.
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stefan Wienert
>>>>>>>>
>>>>>>>> http://www.wienert.cc
>>>>>>>> stefan@wienert.cc
>>>>>>>>
>>>>>>>> Telefon: +495251-2026838
>>>>>>>> Mobil: +49176-40170270
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Wienert
>>>>>>>
>>>>>>> http://www.wienert.cc
>>>>>>> stefan@wienert.cc
>>>>>>>
>>>>>>> Telefon: +495251-2026838
>>>>>>> Mobil: +49176-40170270
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Mime
View raw message