lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Tue, 04 Mar 2014 10:58:21 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919240#comment-13919240
] 

Shai Erera commented on LUCENE-5476:
------------------------------------

Looks good Rob. I apologize for not mentioning this, but now that XORShift64Random is a public
class, it has to have jdocs on all methods and ctors, otherwise documentation linting will
fail. Can you please add some in your next patch?

About XORShift64Random.nextInt() -- modulo is a bit expensive right? I wonder if there's a
way to generate that faster ... e.g. if SampledDcos did something like {{random.randomLong()
& (binsize-1)}}? I haven't fully thought how that changes the distribution of the generated
numbers - hopefully it doesn't. Would you mind giving it a try? And of course {{binsize-1}}
can be computed once in the ctor.

Also, are you planning to write some unit tests? You can either start with one of the existing
tests or look at old tests. I think maybe start new will be easier. The key point is that
in order to test sampling, we need to index many documents to make the samples _count_. So
e.g. we want to make sure that if we give 10% sample ratio, then a category's count is ~10%
of the expected count.

In the old tests we had issues w/ false positives - tests that failed on these asserts just
because the nature of sampling isn't deterministic. Would be good if we can craft the test
such that on one hand it does test sampling, but on the other hand doesn't cause unwanted
noise.

I do think we can optimize SampledDocs to not use FixedBitSet even in the case of out-of-order
collection (no scores) by keeping an int[] or some other compressed array, especially when
the sample ratio is so small. We can do that later though - we need tests first.

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message