lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <ser...@gmail.com>
Subject Re: Performance Improvement for Search using PriorityQueue
Date Mon, 10 Dec 2007 12:03:52 GMT
I have access to TREC. I can try this.
W.r.t the large indexes - I don't have access to the data, just scenarios
our customers ran into the past.
Does the benchmark package includes code to crawl Wikipedia? If not, do you
have such code? I don't want to write it from scratch for this specific
task.

On Dec 10, 2007 1:50 PM, Michael McCandless <lucene@mikemccandless.com>
wrote:

>
> I don't offhand.  Working on the indexing side is so much easier :)
>
> You mentioned your experience with large indices & large result sets
> -- is that something you could draw on?
>
> There have also been discussions lately about finding real search
> logs we could use for exactly this reason, though I don't think
> that's come to a good solution yet.
>
> As a simple test you could break Wikipedia into smallish docs (~4K
> each = ~2.1 million docs), build the index, and make up a set of
> queries, or randomly pick terms for queries?  Obviously the queries
> aren't "real", but it's at least a step closer.... progress not
> perfection.
>
> Or, if you have access to TREC...
>
> Mike
>
> Shai Erera wrote:
>
> > Do you have a dataset and queries I can test on?
> >
> > On Dec 10, 2007 1:16 PM, Michael McCandless
> > <lucene@mikemccandless.com>
> > wrote:
> >
> >> Shai Erera wrote:
> >>
> >>> No - I didn't try to populate an index with real data and run real
> >>> queries
> >>> (what is "real" after all?). I know from my experience of indexes
> >>> with
> >>> several millions of documents where there are queries with several
> >>> hundred
> >>> thousands results (one query even hit 2.5 M documents). This is
> >>> typical in
> >>> search: users type on average 2.3 terms in a query. The chances
> >>> you'd hit a
> >>> query with huge result set are not that small in such cases (I'm
> >>> not saying
> >>> this is the most common case though, I agree that most of the
> >>> searches don't
> >>> process that many documents).
> >>
> >> Agreed: many queries do hit a great many results.  But I agree with
> >> Paul:
> >> it's not clear how this "typically" translates into how many
> >> ScoreDocs
> >> get created?
> >>
> >>> However, this change will improve performance from the algorithm
> >>> point of
> >>> view - you allocate as many as numRequestedHits+1 no matter how many
> >>> documents your query processes.
> >>
> >> It's definitely a good step forward: not creating extra garbage in
> >> hot
> >> spots is worthwhile, so I think we should make this change.  Still
> >> I'm
> >> wondering how much this helps in practice.
> >>
> >> I think benchmarking on "real" use cases (vs synthetic tests) is
> >> worthwhile: it keeps you focused on what really counts, in the end.
> >>
> >> In this particular case there are at least 2 things it could show us:
> >>
> >>   * How many ScoreDocs really get created, or, what %tg of hits
> >>     actually result in an insertion into the PQ?
> >>
> >>   * How much is this savings as a %tg of the overall time spent
> >>     searching?
> >>
> >> Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Regards,

Shai Erera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message