mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Fernández <fernando.fernandez.gonza...@gmail.com>
Subject Re: tf-idf + svd + cosine similarity
Date Tue, 14 Jun 2011 19:23:35 GMT
Hi Stefan,

Are  you sure you need to transpose the input marix? I thought that what you
get from lucene index was already document(rows)-term(columns) matrix, but
you say that you obtain term-document matrix and transpose it. Is this
correct? What are you using to obtain this matrix from Lucene? Is it
possible that you are calculating similarities with the wrong matrix in some
of the two cases? (With/without dimension reduction).

Best,
Fernando.

2011/6/14 Sebastian Schelter <ssc@apache.org>

> Hi Stefan,
>
> I checked the implementation of RowSimilarityJob and we might still have a
> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
> that, but the similarity scores might not be correct...
>
> We had this issue in 0.4 already, when someone realized that cooccurrences
> were mapped out inconsistently, so for 0.5 we made sure that we always map
> the smaller row as first value. But apparently I did not adjust the value
> setting for the Cooccurrence object...
>
> In 0.5 the code is:
>
>  if (rowA <= rowB) {
>   rowPair.set(rowA, rowB, weightA, weightB);
>  } else {
>   rowPair.set(rowB, rowA, weightB, weightA);
>  }
>  coocurrence.set(column.get(), valueA, valueB);
>
> But I should be (already fixed in current trunk some days ago):
>
>  if (rowA <= rowB) {
>   rowPair.set(rowA, rowB, weightA, weightB);
>   coocurrence.set(column.get(), valueA, valueB);
>  } else {
>   rowPair.set(rowB, rowA, weightB, weightA);
>   coocurrence.set(column.get(), valueB, valueA);
>  }
>
> Maybe you could rerun your test with the current trunk?
>
> --sebastian
>
>
> On 14.06.2011 20:54, Sean Owen wrote:
>
>> It is a similarity, not a distance. Higher values mean more
>> similarity, not less.
>>
>> I agree that similarity ought to decrease with more dimensions. That
>> is what you observe -- except that you see quite high average
>> similarity with no dimension reduction!
>>
>> An average cosine similarity of 0.87 sounds "high" to me for anything
>> but a few dimensions. What's the dimensionality of the input without
>> dimension reduction?
>>
>> Something is amiss in this pipeline. It is an interesting question!
>>
>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<stefan@wienert.cc>
>>  wrote:
>>
>>> Actually I'm using  RowSimilarityJob() with
>>> --input input
>>> --output output
>>> --numberOfColumns documentCount
>>> --maxSimilaritiesPerRow documentCount
>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>
>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>> calculates...
>>> the source says: "distributed implementation of cosine similarity that
>>> does not center its data"
>>>
>>> So... this seems to be the similarity and not the distance?
>>>
>>> Cheers,
>>> Stefan
>>>
>>>
>>>
>>> 2011/6/14 Stefan Wienert<stefan@wienert.cc>:
>>>
>>>> but... why do I get the different results with cosine similarity with
>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>
>>>> 2011/6/14 Fernando Fernández<fernando.fernandez.gonzalez@gmail.com>:
>>>>
>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>> 1000
>>>>> the similarity avg is the lowest...
>>>>>
>>>>>
>>>>> 2011/6/14 Jake Mannix<jake.mannix@gmail.com>
>>>>>
>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>  In
>>>>>> higher
>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>> other
>>>>>> hand,
>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>
>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<stefan@wienert.cc>
>>>>>> wrote:
>>>>>>
>>>>>>  Hey Guys,
>>>>>>>
>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>
>>>>>>> First, I explain the steps my data is making:
>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using
TFIDF
>>>>>>> as
>>>>>>> weighter
>>>>>>> 2) Transposing TDM
>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>
>>>>>>> Now... Some strange thinks happen:
>>>>>>> First of all: The demo data shows the similarity from document
1 to
>>>>>>> all other documents.
>>>>>>>
>>>>>>> the results using only cosine similarty (without dimension
>>>>>>> reduction):
>>>>>>> http://the-lord.de/img/none.png
>>>>>>>
>>>>>>> the result using svd, rank 10
>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>> some points falling down to the bottom.
>>>>>>>
>>>>>>> the results using ssvd rank 10
>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>
>>>>>>> the result using svd, rank 100
>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>> more points falling down to the bottom.
>>>>>>>
>>>>>>> the results using ssvd rank 100
>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>
>>>>>>> the results using svd rank 200
>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>> even more points falling down to the bottom.
>>>>>>>
>>>>>>> the results using svd rank 1000
>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>> most points are at the bottom
>>>>>>>
>>>>>>> please beware of the scale:
>>>>>>> - the avg from none: 0,8712
>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>
>>>>>>> so my question is:
>>>>>>> Can you explain this behavior? Why are the documents getting
more
>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message