lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Audenaerde (JIRA)" <>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Thu, 06 Mar 2014 10:48:54 GMT


Rob Audenaerde commented on LUCENE-5476:

Thanks Shai,

I have fixed the points you noted about the collector. I renamed the sampleThreshold to sampleSize.
It currently picks a samplingRatio that will reduce the number of hits to the sampleSize,
if the number of hits is greater. 

I have a  general question about your remarks about the test, besides fixing the obvious (names,
commit, sops). Is there a reason to add more randomness to one test? I normally try to test
one aspect in a unit test. And if I also want to test some other aspect, like random document
counts (to test the sampleratio for example), I add more tests. 
   Make the two collector instances take 100/10% of the numDocs when you fix it
Sorry, I don't get what you mean by this.
    I don't understand how you know that numChildren=5 when you ask for the 10 top children.
Isn't it possible that w/ some random seed the number of children will be different?
        In fact, I think that the random collectors should be initialized w/ a random seed
that depends on the test? Currently they aren't and so always use 0xdeadbeef?
There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits.
There is a REAL small chance that one of the five values will be entirely missed when sampling.
But is that {{0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193}},
so that is probable not going to happen :).

> Facet sampling
> --------------
>                 Key: LUCENE-5476
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
LUCENE-5476.patch, LUCENE-5476.patch,,
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message