lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Deeper Ranking Issues in Lucene
Date Mon, 02 Apr 2007 20:24:06 GMT

I'm not sure anyone can fully address all your questions, but I  
thought I would point you at 
scoring.html if you haven't already started there.  It has some  
details on scoring, as well as pointers into lower level details.

Some comments are also inline below, which are just a few thoughts on  
some of your questions, but nothing in-depth like I think you are  
looking for.  I am not able to answer in more detail at this time,  
but can give you some pointers on where to look.

Hope it helps.


On Apr 2, 2007, at 3:35 PM, wrote:

> I’ve been working on ranking/scoring issues for full text search for
> years. When I try to implement different ranking inside the Lucene  
> engine,
> I face some problems. I list some of my experience and questions here.
> The major task I want to implement inside Lucene is different
> ranking/scoring algorithms. I may not find the correct source of
> information, but I really cannot find a detailed document  
> describing the
> relations among ranking/scoring related classes in Lucene on web.  
> “Lucene
> in Action” concerns mostly about applications level usage but not  
> these
> lower-level APIs.
> (1) The first thing I tried is the abstract class Scorer. The  
> description
> for this class is:
> /** Expert: Implements scoring for a class of queries. */
> If you looked IndexSearcher class, one possible search process is
> implemented inside
> public TopDocs search(Query query, Filter filter, final int nDocs).  
> If you
> look in detail how it is implemented, you will find out it first  
> acquires
> such a scorer instance by:
> Scorer scorer = query.weight(this).scorer(reader);
> The magic happens inside the method call scorer.score(new  
> HitCollector());
> What the score method does is something similar to an Iterator. The
> scorer.score continuously call to acquire next qualified
> document (I guess the qualified documents inside this iterator is from
> BitSet operations from the given query. I did not looked into the  
> detail
> implementation of that), and calls scorer.score() to get the score  
> for the
> current document that the iterator pointed at.
> What the HitCollector does is merely simple. It only implement a  
> method
> called collect(int doc, float score). This method is called every time
> when a new document’s score are calculated. In the  
> method, the documents and their scores are sent to a PriorityQueue and
> ranked according to the scores.
> (2) My first plan is to modify the scorer.score()method.  
> Unfortunately, I
> found this is extremely complex. score() is an abstract method  
> which is
> implemented inside its subclasses like Boolean Scorer, Conjunction  
> Scorer,
> Phrase Score, … Since I do not need to consider the complex Boolean  
> query
> syntax (in my experiments, query is defined as a list of terms  
> connected
> by disjunctions), I implement the score method inside Scorer rather  
> inside
> sub classes.
> What I did is every time when Scorer.score are called, I pass the  
> current
> document number via doc(), and read out the term frequencies from
> IndexReader. getTermFreqVector(int docNumber, String field)  method. I
> found this operation is super slow. The major cost is spent on the  
> method
> getTermFreqVector.

Term Vectors are stored differently from TermDocs, so I am not  
surprised that they are slower.  There probably could be some  
optimizations made to their storage, but I would guess that most  
people don't use Term Vecs all that much, so no one has looked too  
closely at where optimizations could be made.  They almost certainly  
are not using them to loop over all the docs in the system.  Most  
scenarios, I believe are for highlighting results on a doc-by-doc  
basis or for relevance feedback/more like this.  Think of Term  
Vectors as an after the fact addition to Lucene for convenience  
purposes, not for scoring/performance.

> (3) Later I notice in the implementation in TermScorer, there is a
> function call
>, freqs);    // refill buffer
> And I read the comments for this function in IndexReader, it says
> /** Attempts to read multiple entries from the enumeration, up to  
> length of
>    * <i>docs</i>.  Document numbers are stored in <i>docs</i>,
> term
>    * frequencies are stored in <i>freqs</i>.  The <i>freqs</i>
> array must
> be as
>    * long as the <i>docs</i> array.
>    *
>    * <p>Returns the number of entries read.  Zero is only returned  
> when the
>    * stream has been exhausted.  */
> So different from IndexReader.getTermFreqVector, which read out  
> term-freq
> vectors for a document, this function read doc-freq vectors for a  
> term. I
> find this method call is extremely faster. At least two magnitudes  
> faster
> than getTermFreqVector, if I want to get all term-freqs for given  
> terms
> for all docs. I do not know the reason why there is such difference, I
> cannot find a document describing the implementation difference and
> purpose among these two.
> (4) About extending Lucene’s scoring function.
> If we want to implement any arbitrary ranking/scoring for term- 
> frequency
> based algorithms, we need the scoring fits the following framework
> Any term-frequency scoring can be expressed in the following ways  
> (let’s
> simplify that there is only one field so that we can ignore boost  
> factor
> at this time):
> Sigma (t in q) [TermWeight(t in d)*TermWeight(t in q) / lengthNorm 
> (d) /
> lengthNorm(q)]
> Since lengthNorm(q) is the same for each document, it will not affect
> ranking, we just ignore it. We further separate term weight into three
> parts, term weight related to document, term weight related to  
> query, term
> weight related to corpus (not related to document or query) and  
> document
> length norm.
> Sigma (t in q) [TW(t in d)*TW(t in q) * TW (t) / lengthNorm(d)]
> We can notice this matches Lucene’ ranking function
> Sigma (t in q) [tf(t in d)*1*idf (t) / lengthNorm(d)]. To save
> computational cost, lengthNorm(d) is pre-calculated when indexing the
> corpus. So Lucene’ lengthNorm(d) does not involve any corpus  
> statistics
> into calculation, it is merely the # of terms inside this document.  
> On the
> other hand, it treats the term weight in query is the same as 1.  
> That is
> Lucene does not differentiate terms important inside query. If we  
> want to
> emphasis a term twice we need to put two terms inside the query. For
> example, instead of search “boat” we put “boat boat”, do I understand
> correctly?
> So, generally, if we can update the lengthNorm(d) inside the  
> indexing code
> or via post-processing after all documents are indexed, Lucene can
> implement any arbitrary ranking function such as BM25. A pity is  
> that it
> does not directly support query term weight in the query language.

We like patches and will at least consider and discuss most any patch  
that is well thought out, backward compatible and tested and where  
the author makes a good case for it.

> (5) My final big question would be how Lucene really implement
> ranking/scoring. We could notice there are two possible strategies.  
> Each
> of them will result in different flexibility if we need modify current
> ranking algorithms.
> The first strategy is Lucene generate document scores in a document by
> document manner. In Scorer.score, we notice the framework is each  
> time the
> scorer meet a new document, Lucene will generate a score for this
> document. This framework is simple and intuitive. And all we  
> discussed in
> (4) will fit this framework. When you processing the current document,
> lengthNorm(d) can be read, even if TW(t in d) has relation to
> lengthNorm(d) we can calculate that accordingly. Unfortunately, I  
> cannot
> find any low-level code could be thinked related to this  
> implementation
> strategy. This makes me think the Scorer.score method is not the real
> place Lucene implement its ranking. I am pretty confused about this  
> part.
> Who can help me with this?

You might take a look at how Query/Weight/Scorers are implemented.   
For instance, I just added a BoostingTermQuery to Java version that  
implements this trifecta.

> The second strategy is Lucene generate document scores in a term by  
> term
> manner. For each term inside the query, Lucene calls 
> (docs,
> freqs) to accumulate scores over the specific term dimension over  
> all the
> documents.  Under this framework (which I feel is current Lucene’s
> implementation), what we discussed in (4) is not held. We need extend
> current Lucene’s tf(float freq) function to tf(float freq, float  
> norm), so
> that arbitrary ranking could be implemented.

There is some discussion of flexible indexing approaches on Lucene  
Java, which also, to me anyway, implies flexible scoring.  You might  
search that archive for "flexible indexing" to see if anything  
strikes you.

> (6) I have implemented a search module outside Lucene IndexSearcher  
> which
> could implement any arbitrary ranking over vector query (a list of
> disjunctive terms). The term-by-term implementation is much faster  
> than my
> previous document-by-document implementation. But I currently still  
> cannot
> encode the ranking module under Lucene’ Boolean engines, that is  
> only rank
> the documents retrieved (only rank the documents satisfy the condition
> specified by the query).
> __________________________________________________________________
> I do not know whether I described my points clear, it is complex  
> and hard
> to write in a short plain text message. Maybe I should post a PDF  
> version
> tech-report online so that the problem is stated clearer. I am not  
> sure my
> understanding is correct. Thanks for any help and comments.
> Xiangyu Jin
> Department of Computer Science
> University of Virginia
> Charlottesville, VA 22903

Grant Ingersoll
Center for Natural Language Processing

Read the Lucene Java FAQ at 

View raw message