lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gilad Barkai (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Sat, 01 Mar 2014 19:00:23 GMT


Gilad Barkai commented on LUCENE-5476:

Great effort!

I wish to through in another part - the description of this issue is about sampling, but the
implementation is about *random* sampling.
This is not always the case, nor it is very fast (indeed, calling 1M times Random.nextInt
would be measurable by itself IMHO).
A different sample could be
int acceptedModulu = (int)(1/sampleRatio);

int next() {
  do {
    nextDoc =;
  } while (nextDoc != NO_MORE_DOCX && nextDoc % acceptedModulu != 0) ;

  return nextDoc;

This should be faster as a sampler, and perhaps saves us from creating a new {DocIdSet}.

One last thing - if I did the math right - the sample crafted by the code in the patch would
be twice as large as the user may expect.
For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the avg. "jump"
is actually 5 - and every 5th document in the original set (again, in avg) would be selected,
and not every 10th in avg. I think the random.nextInt should be called with twice the size
it is called now (e.g 20, making the avg random selection 10).

> Facet sampling
> --------------
>                 Key: LUCENE-5476
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch,
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message