lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Performance Improvement for Search using PriorityQueue
Date Mon, 10 Dec 2007 13:52:50 GMT

OK, sounds like a plan, thanks!

Yes, contrib/benchmark has EnwikiDocMaker to generate docs off the  
Wikipedia XML export file.

Mike

On Dec 10, 2007, at 7:03 AM, Shai Erera wrote:

> I have access to TREC. I can try this.
> W.r.t the large indexes - I don't have access to the data, just  
> scenarios
> our customers ran into the past.
> Does the benchmark package includes code to crawl Wikipedia? If  
> not, do you
> have such code? I don't want to write it from scratch for this  
> specific
> task.
>
> On Dec 10, 2007 1:50 PM, Michael McCandless  
> <lucene@mikemccandless.com>
> wrote:
>
>>
>> I don't offhand.  Working on the indexing side is so much easier :)
>>
>> You mentioned your experience with large indices & large result sets
>> -- is that something you could draw on?
>>
>> There have also been discussions lately about finding real search
>> logs we could use for exactly this reason, though I don't think
>> that's come to a good solution yet.
>>
>> As a simple test you could break Wikipedia into smallish docs (~4K
>> each = ~2.1 million docs), build the index, and make up a set of
>> queries, or randomly pick terms for queries?  Obviously the queries
>> aren't "real", but it's at least a step closer.... progress not
>> perfection.
>>
>> Or, if you have access to TREC...
>>
>> Mike
>>
>> Shai Erera wrote:
>>
>>> Do you have a dataset and queries I can test on?
>>>
>>> On Dec 10, 2007 1:16 PM, Michael McCandless
>>> <lucene@mikemccandless.com>
>>> wrote:
>>>
>>>> Shai Erera wrote:
>>>>
>>>>> No - I didn't try to populate an index with real data and run real
>>>>> queries
>>>>> (what is "real" after all?). I know from my experience of indexes
>>>>> with
>>>>> several millions of documents where there are queries with several
>>>>> hundred
>>>>> thousands results (one query even hit 2.5 M documents). This is
>>>>> typical in
>>>>> search: users type on average 2.3 terms in a query. The chances
>>>>> you'd hit a
>>>>> query with huge result set are not that small in such cases (I'm
>>>>> not saying
>>>>> this is the most common case though, I agree that most of the
>>>>> searches don't
>>>>> process that many documents).
>>>>
>>>> Agreed: many queries do hit a great many results.  But I agree with
>>>> Paul:
>>>> it's not clear how this "typically" translates into how many
>>>> ScoreDocs
>>>> get created?
>>>>
>>>>> However, this change will improve performance from the algorithm
>>>>> point of
>>>>> view - you allocate as many as numRequestedHits+1 no matter how  
>>>>> many
>>>>> documents your query processes.
>>>>
>>>> It's definitely a good step forward: not creating extra garbage in
>>>> hot
>>>> spots is worthwhile, so I think we should make this change.  Still
>>>> I'm
>>>> wondering how much this helps in practice.
>>>>
>>>> I think benchmarking on "real" use cases (vs synthetic tests) is
>>>> worthwhile: it keeps you focused on what really counts, in the end.
>>>>
>>>> In this particular case there are at least 2 things it could  
>>>> show us:
>>>>
>>>>   * How many ScoreDocs really get created, or, what %tg of hits
>>>>     actually result in an insertion into the PQ?
>>>>
>>>>   * How much is this savings as a %tg of the overall time spent
>>>>     searching?
>>>>
>>>> Mike
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Shai Erera
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> -- 
> Regards,
>
> Shai Erera


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message