lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Audenaerde (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Thu, 06 Mar 2014 10:48:54 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922293#comment-13922293
] 

Rob Audenaerde commented on LUCENE-5476:
----------------------------------------

Thanks Shai,

I have fixed the points you noted about the collector. I renamed the sampleThreshold to sampleSize.
It currently picks a samplingRatio that will reduce the number of hits to the sampleSize,
if the number of hits is greater. 

I have a  general question about your remarks about the test, besides fixing the obvious (names,
commit, sops). Is there a reason to add more randomness to one test? I normally try to test
one aspect in a unit test. And if I also want to test some other aspect, like random document
counts (to test the sampleratio for example), I add more tests. 
  
{quote}
   Make the two collector instances take 100/10% of the numDocs when you fix it
 {quote}
Sorry, I don't get what you mean by this.
{quote}
    I don't understand how you know that numChildren=5 when you ask for the 10 top children.
Isn't it possible that w/ some random seed the number of children will be different?
        In fact, I think that the random collectors should be initialized w/ a random seed
that depends on the test? Currently they aren't and so always use 0xdeadbeef?
{quote}
There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits.
There is a REAL small chance that one of the five values will be entirely missed when sampling.
But is that {{0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193}},
so that is probable not going to happen :).



> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message