lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-5476) Facet sampling
Date Thu, 06 Mar 2014 18:56:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922884#comment-13922884
] 

Shai Erera edited comment on LUCENE-5476 at 3/6/14 6:56 PM:
------------------------------------------------------------

bq. but any facet accumulation which would rely on document scores would be hit by the second
as the scores

That's a great point Gilad. We need a test which covers that with random sampling collector.

bq. Is there a reason to add more randomness to one test?

It depends. I have a problem with numDocs=10,000 and percents being 10% .. it creates too
perfect numbers if you know what I mean. I prefer a random number of documents to add some
spice to the test. Since we're testing a random sampler, I don't think it makes sense to test
it with a fixed seed (0xdeadbeef) ... this collector is all about randomness, so we should
stress the randomness done there. Given our test framework, randomness is not a big deal at
all, since once we get a test failure, we can deterministically reproduce the failure (when
there is no multi-threading). So I say YES, in this test I think we should have randomness.

But e.g. when you add a test which ensures the collector works well w/ sampled docs and scores,
I don't think you should add randomness -- it's ok to test it once.

Also, in terms of test coverage, there are other cases which I think would be good if they
were tested:

* Docs + Scores (discussed above)
* Multi-segment indexes (ensuring we work well there)
* Different number of hits per-segment (to make sure our sampling on tiny segments works well
too)
* ...

I wouldn't for example use RandomIndexWriter because we're only testing search (and so it
just adds noise in this test). If we want many segments, we should commit/nrt-open every few
docs, disable merge policy etc. These can be separate, real "unit-", tests.

bq. Sorry, I don't get what you mean by this.

I meant that if you set {{numDocs = atLeast(8000)}}, then the 10% sampler should not be hardcoded
to 1,000, but {{numDocs * 0.1}}.

bq. the original totalHits .. is used

I think that's OK. In fact, if we don't record that, it would be hard to fix the counts no?

{quote}
There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits.
There is a REAL small chance that one of the five values will be entirely missed when sampling.
But is that 0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193, so
that is probable not going to happen
{quote}

Ahh thanks, I missed that. I agree it's very improbable that one of the values is missing,
but if we can avoid that at all it's better. First, it's not one of the values, we could be
missing even 2 right -- really depends on randomness. I find this assert just redundant --
if we always expect 5, we shouldn't assert that we received 5. If we say that very infrequently
we might get <5 and we're OK with it .. what's the point of asserting that at all?

bq. I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will
reduce the number of hits to the sampleSize, if the number of hits is greater.

It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's
100K whether there are 101K docs or 100M docs, right? Is that your intention?


was (Author: shaie):
bq. but any facet accumulation which would rely on document scores would be hit by the second
as the scores

That's a great point Gilad. We need a test which covers that with random sampling collector.

bq. Is there a reason to add more randomness to one test?

It depends. I have a problem with numDocs=10,000 and percents being 10% .. it creates too
perfect numbers if you know what I mean. I prefer a random number of documents to add some
spice to the test. Since we're testing a random sampler, I don't think it makes sense to test
it with a fixed seed (0xdeadbeef) ... this collector is all about randomness, so we should
stress the randomness done there. Given our test framework, randomness is not a big deal at
all, since once we get a test failure, we can deterministically reproduce the failure (when
there is no multi-threading). So I say YES, in this test I think we should have randomness.

But e.g. when you add a test which ensures the collector works well w/ sampled docs and scores,
I don't think you should add randomness -- it's ok to test it once.

Also, in terms of test coverage, there are other cases which I think would be good if they
were tested:

* Docs + Scores (discussed above)
* Multi-segment indexes (ensuring we work well there)
* Different number of hits per-segment (to make sure our sampling on tiny segments works well
too)
* ...

I wouldn't for example use RandomIndexWriter because we're only testing search. If we want
many segments, we should commit/nrt-open every few segments, disable merge policy etc. These
can be separate, real "unit", tests.

bq. Sorry, I don't get what you mean by this.

I meant that if you set {{numDocs = atLeast(8000)}}, then the 10% sampler should not be hardcoded
to 1,000, but {{numDocs * 0.1}}.

bq. the original totalHits .. is used

I think that's OK. In fact, if we don't record that, it would be hard to fix the counts no?

{quote}
There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits.
There is a REAL small chance that one of the five values will be entirely missed when sampling.
But is that 0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193, so
that is probable not going to happen
{quote}

Ahh thanks, I missed that. I agree it's very improbable that one of the values is missing,
but if we can avoid that at all it's better. First, it's not one of the values, we could be
missing even 2 right -- really depends on randomness. I find this assert just redundant --
if we always expect 5, we shouldn't assert that we received 5. If we say that very infrequently
we might get <5 and we're OK with it .. what's the point of asserting that at all?

bq. I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will
reduce the number of hits to the sampleSize, if the number of hits is greater.

It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's
100K whether there are 101K docs or 100M docs, right? Is that your intention?

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java,
SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message