lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma - Buyways B.V." <mar...@buyways.nl>
Subject Re: tf*idf scoring
Date Tue, 03 Nov 2009 13:54:03 GMT


> >
> >
> > According to different algorithms, the tf for term c would be 3 / 1 =
> > 0.33 instead of 1 returned by Solr.
> 
> I don't follow.  The TF (term frequency) is the number of times the  
> term c occurs in that particular document, i.e. 1 time.


I see that above, and below, i made some typo's.  I wrote 3 / 1 = 0.3
instead of 1 / 3 = 0.33. Term c has a #occurences of 1 which the other
algorithms normalize by dividing by the number of terms. So instead of a
tf = #occurences (1) other algorithms do tf = #occurences / #terms
(0.33). 


> 
> > Also, the tf*idf value i get is 0.5
> > for term c and i get 0.333 for term a. It looks like tf*idf is  
> > quotient
> > of document frequency and term frequency.
> 
> Yes, indeed.  IDF == Inverse Document Frequency, in other words, 1/DF.


Indeed, but most algorithms i have seen on this topic calculate idf by
ln(#docs / df), this is also true for Lucene as i read
http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/Similarity.html

idf(t)  =   1 + log (numDocs / df + 1)


> 
> >
> > If i calculate tf*idf, for term c in the first document, according to
> > other algorithms it would be:
> >
> > tf = 3 / 1 = 0.333
> 
> 3/1 = 3, no?  I don't see where in your docs above you could even get  
> a 3 for the letter c.


Here's the other typo, i wrote again 3 / 1 = 0.33 what should've been
1 / 3 = 0.33, of course. The differences i see are:

tf (solr) = #occurences_of_term_T in document_D
tf (other) = #occurences_of_term_T in document_D / #terms_document_D

df (solr) = #occurences_of_term_T in all_documents
df (other) = #occurences_of_term_T in all_documents

idf (solr) = tf / df
idf (other) = ln(#documents / df)

tf*idf (solr) = tf / df
tf*idf (other) = tf * idf


> 
> > idf = ln(6 / 3) = 1.0986
> > tf*idf = 0.333 * 1.0986 = 0.3658
> >
> 
> I think the formulas you are looking at are doing operations to  
> normalize the values, whereas the Solr/Lucene stuff above is telling  
> you their raw values.  Note, Lucene/Solr does length normalization,  
> etc. too, it just isn't encoded into the TF or DF.  For more on  
> Lucene's scoring, see http://lucene.apache.org/java/2_9_0/scoring.html
> 


I see, but why not return the true values of Lucene? I did not
reconfigure Solr's scheme to use another algorithm for similarity and
the above Lucene similarity docs state that they use similar
calculations as i have in DefaultSimilarty.



> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message