lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Performance Improvement for Search using PriorityQueue
Date Mon, 10 Dec 2007 11:50:53 GMT

I don't offhand.  Working on the indexing side is so much easier :)

You mentioned your experience with large indices & large result sets  
-- is that something you could draw on?

There have also been discussions lately about finding real search  
logs we could use for exactly this reason, though I don't think  
that's come to a good solution yet.

As a simple test you could break Wikipedia into smallish docs (~4K  
each = ~2.1 million docs), build the index, and make up a set of  
queries, or randomly pick terms for queries?  Obviously the queries  
aren't "real", but it's at least a step closer.... progress not  

Or, if you have access to TREC...


Shai Erera wrote:

> Do you have a dataset and queries I can test on?
> On Dec 10, 2007 1:16 PM, Michael McCandless  
> <>
> wrote:
>> Shai Erera wrote:
>>> No - I didn't try to populate an index with real data and run real
>>> queries
>>> (what is "real" after all?). I know from my experience of indexes  
>>> with
>>> several millions of documents where there are queries with several
>>> hundred
>>> thousands results (one query even hit 2.5 M documents). This is
>>> typical in
>>> search: users type on average 2.3 terms in a query. The chances
>>> you'd hit a
>>> query with huge result set are not that small in such cases (I'm
>>> not saying
>>> this is the most common case though, I agree that most of the
>>> searches don't
>>> process that many documents).
>> Agreed: many queries do hit a great many results.  But I agree with
>> Paul:
>> it's not clear how this "typically" translates into how many  
>> ScoreDocs
>> get created?
>>> However, this change will improve performance from the algorithm
>>> point of
>>> view - you allocate as many as numRequestedHits+1 no matter how many
>>> documents your query processes.
>> It's definitely a good step forward: not creating extra garbage in  
>> hot
>> spots is worthwhile, so I think we should make this change.  Still  
>> I'm
>> wondering how much this helps in practice.
>> I think benchmarking on "real" use cases (vs synthetic tests) is
>> worthwhile: it keeps you focused on what really counts, in the end.
>> In this particular case there are at least 2 things it could show us:
>>   * How many ScoreDocs really get created, or, what %tg of hits
>>     actually result in an insertion into the PQ?
>>   * How much is this savings as a %tg of the overall time spent
>>     searching?
>> Mike
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> -- 
> Regards,
> Shai Erera

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message