lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Audenaerde (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Fri, 07 Mar 2014 16:40:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13924037#comment-13924037
] 

Rob Audenaerde commented on LUCENE-5476:
----------------------------------------

{quote}
...Given our test framework, randomness is not a big deal at all, since once we get a test
failure, we can deterministically reproduce the failure (when there is no multi-threading)...
{quote}
Ok, this makes sense to me. 

{quote}
It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's
100K whether there are 101K docs or 100M docs, right? Is that your intention?
{quote}
Correct, it is my intention. I actually prefer not to increase the {{sampleSize}} with more
hits, as bigger samples are slower and 100K is a nice sample size anyway and more hits means
more time. I adjust the sampleRatio so that the resulting set of documents is (close to) the
{{sampleSize}}.

{quote}
I find this assert just redundant – if we always expect 5, we shouldn't assert that we received
5. If we say that very infrequently we might get <5 and we're OK with it .. what's the
point of asserting that at all?
{quote}
Agreed with the <5. Asserting seems redundant, but is that not the point in unit-tests?
The trick is that the assertion should still hold if you change the implementation.. 

I will add more next week. 

Btw. Is there an easy way to retrieve the total facet counts for a ordinal? When correcting
facet counts it would a quick win to limit the number of estimated documents to the actual
number of documents in the index that match that facet. (And maybe use the distribution as
well, to make better estimates)

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java,
SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message