lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Mon, 03 Mar 2014 15:37:22 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918164#comment-13918164
] 

Shai Erera commented on LUCENE-5476:
------------------------------------

+1 for removing SamplingParams.

I'm OK if the original collected hits is lost, let's just make sure we document this on the
collector. Perhaps in the future (separate issue) we can add another collector (or enhance
this one) to allow you to get both the original docs and the sampled docs.

I wonder what are the performance implications of sampling during collection or post collection.
In the past, Mike and I saw that not interfering with collection improves performance, though
what we measured was accessing the ordinals-DV during collection. Just wondering if in-collection
performs better than post-collection. Especially if sampleRatio is low, it means we set far
fewer hits on the FBS, just to clean most of them afterwards. On the other hand we add calls
to random.nextInt/Double, so that's a tradeoff which would be good to measure. We don't even
need random.nextDouble(), we can do what you/mike suggested above -- work in "bins", for each
bin draw a random index and discard all hits given to addDoc unless it is the binIndex.

I also think that if we keep the original FBS (whether we clone-and-clear, create-new-sampled-one
or whatever), we should iterate on the matching docs and not all the bits. I don't see the
logic of why that's good at all, unless the bitset.cardinality() is very big (like maybe 75%
of the bitset size)? Of course, if we move to sample in-collection, that's not an issue at
all.

Rob, I want to make sure we are all on the same page as to what we want to achieve in this
issue (helps scope it):

* Add SamplingFacetsCollector which takes {{sampleRatio}} and optional {{seed}} and random-samples
the matching docs so that facets are counted on fewer hits
* We should note that as a result the weights computed for facets are approximation only,
and the application needs to correct them if it wants (e.g. multiplying by the inverse of
{{sampleRatio}} or exact count etc.). At any rate, this is something that we don't plan to
tackle in this issue, right?

Maybe we should call it RandomSamplingFacetsCollector to denote it's a random sample (vs e.g
the approaches mentioned by Gilad above)? Then the parameter {{seed}} is clear as to what
it means.

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java,
SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message