lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nadav Har'El" <>
Subject Re: Performance Improvement for Search using PriorityQueue
Date Tue, 11 Dec 2007 16:54:34 GMT
On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for Search using PriorityQueue":
> Hi
> Lucene's PQ implements two methods: put (assumes the PQ has room for the
> object) and insert (checks whether the object can be inserted etc.). The
> implementation of insert() requires the application that uses it to allocate
> a new object every time it calls insert. Specifically, it cannot reuse the
> objects that were removed from the PQ.

I've read this entire thread, and would like to add my comments about three
independent issues, which I think that can and perhaps should be considered

1. When Shai wanted to add the insertWithOverflow() method to PriorityQueue
   he couldn't just subclass PriorityQueue in his application, but rather
   was forced to modify PriorityQueue itself. Why? just because one field
   of that classi - "heap" - was defined "private" instead of "protected".

   Is there a special reason for that? If not, can we make the trivial change
   to make PriorityQueue's fields protected, to allow Shai and others (see the
   next point) to add functionality on top of PriorityQueue?

2. PriorityQueue, in addition to being used in about a dozen places inside
   Lucene, is also a public class that advanced users often find useful when
   implementing features like new collectors, new queries, and so on.
   Unfortunately, my experience exactly matches Shai's: In the two occasions
   where I used a PriorityQueue, I found that I needed such a 
   insertWithOverflow() method. If this feature is so useful (I can't believe
   that Shai and me are the only ones who found it useful), I think it would
   be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used
   inside Lucene.

   Just to make it sound more interesting, let me give you an example where
   I needed (and implemented) an "insertWithOverflow()": I was implementing a
   faceted search capability over Lucene. It calculated a count for each
   facet value, and then I used a PriorityQueue to find the 10 best values.
   The problem is that I also needed an "other" aggregator, which was supposed
   to aggregate (in various ways) all the facets except the 10 best ones. For
   that, I needed to know which facets dropped off the priorityqueue.

3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow()
   to be used in TopDocCollector. I have to admit I don't know how much
   of a benefit this will be in the "typical" case. But I do know that
   there's no such thing as a "typical" case...
   I can easily think of a quite typical "worst case" though: Consider a
   collection indexed in order of document age (a pretty typical scenario
   for a long-running index), and then you do a sorting query
   (TopFieldDocCollector), asking it to bring the 10 newest documents.
   In that case, each and every document will have a new DocScore created -
   which is the worst-case that Shai feared.
   It would be nice if Shai or someone else could provide a measurement in
   that case.

P.S. When looking now at PriorityQueue's code, I found two tiny
performance improvements that could be easily made to it - I wonder if
there's any reason not to do them:

 1. Insert can use heap[1] directly instead of calling top(). After all,
    this is done in an if() that already ensures that size>0.

 2. Regardless, top() could return heap[1] always, without any if(). After
    all, the heap array is initialized to all nulls, so when size==0, heap[1]
    is null anyway.

> PriorityQueue change (added insertWithOverflow method)
> -----------------------------------------------------------------------------------
>     /**
>      * insertWithOverflow() is similar to insert(), except its return value:
> it
>      * returns the object (if any) that was dropped off the heap because it
> was
>      * full. This can be the given parameter (in case it is smaller than the
>      * full heap's minimum, and couldn't be added) or another object that
> was
>      * previously the smallest value in the heap and now has been replaced
> by a
>      * larger one.
>      */
>     public Object insertWithOverflow(Object element) {
>         if (size < maxSize) {
>             put(element);
>             return null;
>         } else if (size > 0 && !lessThan(element, top())) {
>             Object ret = heap[1];
>             heap[1] = element;
>             adjustTop();
>             return ret;
>         } else {
>             return element;
>         }
>     }
> [Very similar to insert(), only it returns the object that was kicked out of
> the Queue, or null]

Nadav Har'El                        |       Tuesday, Dec 11 2007, 3 Tevet 5768
IBM Haifa Research Lab              |-----------------------------------------
                                    |A professor is one who talks in someone           |else's sleep.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message