lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Tue, 04 Mar 2014 16:11:23 GMT


Shai Erera commented on LUCENE-5476:

That's good point Gilad. I think once this gets into Lucene it means other people will use
it and we should offer a good sampling collector that works in more than one extreme case
(always tons of results) even if it's well documented. One of the problems is that when you
have a query Q, you don't know in advance how many documents it's going to match.

That's where the min/maxDocsToEvaluate came in handy in the previous solution -- it made SamplingFC
smart and adaptive. If the query matched very few documents, not only it didn't bother to
sample and save CPU, it also didn't come up w/ a crappy sample (as Gilad says, 10 docs). The
previous sampling worked on the entire query, the new collector can be used to use these threshold

But I feel that this has to give a qualitative solution -- the sample has be meaningful in
order to be considered as representative at all, and we should let the app specify what "meaningful"
is to it, in the form of minDocsToEvaluate(PerSegment).

And since sampling is about improving speed, we should also let the app specify a maxDocsToEvaluate(PerSegment),
so a 1% sample still doesn't end up evaluating millions of documents.

Robert, I agree w/ your comment on XORShiftRandom - it was a mistake to suggest moving it
under core.

Rob, I feel like I've thrown you back and forth with the patch. If you want, I can take a
stab at making the changes to SFC.

> Facet sampling
> --------------
>                 Key: LUCENE-5476
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,,
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message