lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Wed, 05 Mar 2014 11:52:43 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920771#comment-13920771
] 

Shai Erera commented on LUCENE-5476:
------------------------------------

bq. Actually, in my application, I always do a count before any other search/facetting

Hmm, what do you mean? How do you count the number of hits before you execute the search?

The reason why the previous sampling solution did not do sampling per-segment is that in order
to get to a good sample size and representative set, you need to know first how many documents
the query matches and only then you can do a good sampling, taking min/maxSampleSize into
account. Asking the app to define these boundaries per-segment is odd because app may not
know how many segments an index has, or even the distribution of the segment sizes. For instance,
if an index contains 10 segments and the app is willing to fully evaluate 200K docs in order
to get a good sampled set, it would be wrong to specify that each segment needs to sample
20K docs, because the last two segments may be tiny and so in practice you'll end up w/ a
sampled set of ~160K docs. On the other hand, if the search is evaluated entirely, such that
you know the List<MatchingDocs> before sampling, you can now take a global decision
about which documents to sample, given the min/maxSampleSize constraints.

At the beginning of this issue I thought that sampling could work like that:

{code}
FacetsCollector fc = new FacetsCollector(...);
searcher.search(q, fc);
Sampler sampler = new RandomSampler(...);
List<MatchingDocs> sampledDocs = sampler.sample(fc.getMachingDoc());
facets.count(sampledDocs);
{code}

But the Facets impls all take FacetsCollector, so perhaps what we need is to implement RandomSamplingFacetsCollector
and only override getMatchingDocs() to return the sampled set (and of course cache it). If
we'll later want to return the original set, it's trivial to cache it aside (I don't think
we should do it in this issue).

I realize it means we allocate bitsets unnecessarily, but that a correct way to create a meaningful
sample. Unless we can do it per-segment, but I think it's tricky since we never know how many
hits a segment will match a priori. Perhaps we should focus here to get a correct and meaningful
sample, and improve performance only if it becomes a bottleneck? After all, setting a bit
in a bitset if far faster than scoring the document.

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message