mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Wienert <ste...@wienert.cc>
Subject Re: tf-idf + svd + cosine similarity
Date Tue, 14 Jun 2011 21:28:34 GMT
Hi Sebastian,

the bug does not affect me with:
NONE > bugcheck.pdf
SVD > bugcheck2.pdf
(although it was active)

Cheers,
Stefan


2011/6/14 Sebastian Schelter <ssc@apache.org>:
> Hi Stefan,
>
> I checked the implementation of RowSimilarityJob and we might still have a
> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
> that, but the similarity scores might not be correct...
>
> We had this issue in 0.4 already, when someone realized that cooccurrences
> were mapped out inconsistently, so for 0.5 we made sure that we always map
> the smaller row as first value. But apparently I did not adjust the value
> setting for the Cooccurrence object...
>
> In 0.5 the code is:
>
>  if (rowA <= rowB) {
>   rowPair.set(rowA, rowB, weightA, weightB);
>  } else {
>   rowPair.set(rowB, rowA, weightB, weightA);
>  }
>  coocurrence.set(column.get(), valueA, valueB);
>
> But I should be (already fixed in current trunk some days ago):
>
>  if (rowA <= rowB) {
>   rowPair.set(rowA, rowB, weightA, weightB);
>   coocurrence.set(column.get(), valueA, valueB);
>  } else {
>   rowPair.set(rowB, rowA, weightB, weightA);
>   coocurrence.set(column.get(), valueB, valueA);
>  }
>
> Maybe you could rerun your test with the current trunk?
>
> --sebastian
>
> On 14.06.2011 20:54, Sean Owen wrote:
>>
>> It is a similarity, not a distance. Higher values mean more
>> similarity, not less.
>>
>> I agree that similarity ought to decrease with more dimensions. That
>> is what you observe -- except that you see quite high average
>> similarity with no dimension reduction!
>>
>> An average cosine similarity of 0.87 sounds "high" to me for anything
>> but a few dimensions. What's the dimensionality of the input without
>> dimension reduction?
>>
>> Something is amiss in this pipeline. It is an interesting question!
>>
>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<stefan@wienert.cc>  wrote:
>>>
>>> Actually I'm using  RowSimilarityJob() with
>>> --input input
>>> --output output
>>> --numberOfColumns documentCount
>>> --maxSimilaritiesPerRow documentCount
>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>
>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>> calculates...
>>> the source says: "distributed implementation of cosine similarity that
>>> does not center its data"
>>>
>>> So... this seems to be the similarity and not the distance?
>>>
>>> Cheers,
>>> Stefan
>>>
>>>
>>>
>>> 2011/6/14 Stefan Wienert<stefan@wienert.cc>:
>>>>
>>>> but... why do I get the different results with cosine similarity with
>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>
>>>> 2011/6/14 Fernando Fernández<fernando.fernandez.gonzalez@gmail.com>:
>>>>>
>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>> 1000
>>>>> the similarity avg is the lowest...
>>>>>
>>>>>
>>>>> 2011/6/14 Jake Mannix<jake.mannix@gmail.com>
>>>>>
>>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>  In
>>>>>> higher
>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>> other
>>>>>> hand,
>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>
>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<stefan@wienert.cc>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Guys,
>>>>>>>
>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>
>>>>>>> First, I explain the steps my data is making:
>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using
TFIDF
>>>>>>> as
>>>>>>> weighter
>>>>>>> 2) Transposing TDM
>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>
>>>>>>> Now... Some strange thinks happen:
>>>>>>> First of all: The demo data shows the similarity from document
1 to
>>>>>>> all other documents.
>>>>>>>
>>>>>>> the results using only cosine similarty (without dimension
>>>>>>> reduction):
>>>>>>> http://the-lord.de/img/none.png
>>>>>>>
>>>>>>> the result using svd, rank 10
>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>> some points falling down to the bottom.
>>>>>>>
>>>>>>> the results using ssvd rank 10
>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>
>>>>>>> the result using svd, rank 100
>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>> more points falling down to the bottom.
>>>>>>>
>>>>>>> the results using ssvd rank 100
>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>
>>>>>>> the results using svd rank 200
>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>> even more points falling down to the bottom.
>>>>>>>
>>>>>>> the results using svd rank 1000
>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>> most points are at the bottom
>>>>>>>
>>>>>>> please beware of the scale:
>>>>>>> - the avg from none: 0,8712
>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>
>>>>>>> so my question is:
>>>>>>> Can you explain this behavior? Why are the documents getting
more
>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Stefan
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message