lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5476) Facet sampling
Date Tue, 11 Mar 2014 09:22:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13930111#comment-13930111
] 

Shai Erera commented on LUCENE-5476:
------------------------------------

* Javadocs:
** From the class javadocs: _Note: the original set of hits will be available as documents..._
I think we should just write "the original set of hits can be retrieved from getOriginal.."
- I don't want anyone to be confused with the wording "will be available as documents".
** Can you make NOT_CALCULATED static?
** Typo: samplingRato
** randomSeed: I think it should say "if 0..." not "if null".

* getMatchingDocs -- can you move the totalHits calculation to {{getTotalHits()}}? And then
call it only {{if (sampledDocs==null)}}?

* needsSampling -- I know it was suggested to make it protected for inheritance purposes,
but as it looks now, all members are private so I don't see how can one extend the class only
to override this method (e.g. he won't have access to sampleSize even). Maybe we should keep
it private and when someone asks to extend, we know better what needs to be protected? For
instance, I think it's also important that we allow overriding createSampledDocList, but for
now let's keep it simple.

* I think that we need to document somewhere (maybe in class javadocs) that the returned sampled
docs may include empty MatchingDocs instances (i.e. when no docs were "sampled" from a segment).
Just so that we don't surprise anyone with empty instances. If people work w/ MatchingDocs
as they should, by obtaining an iterator, it shouldn't be a problem, but better document it
explicitly.
** Perhaps we should also say something about the returned MatchingDocs.totalHits, which are
the original totalHits and not the sampled set size?

* About carryOver:
** Doesn't it handle the TODO at the beginning of createSample?
** Why does it need to always include the first document of the first segment? Rather you
could initialize it to -1 and if {{carryOver == -1}} set it to a random index within that
bin? Seems more "random" to me.

* amortizedFacetCounts:
** It's a matter of style so I don't mind if you keep this like you wrote, but I would write
it as {{if (!needsSampling() || res == null)}} and then the rest of the method isn't indented.
Your call.
** I think it's better to allocate {{childPath}} so that the first element is already the
dimension. See what FacetsConfig.pathToString currently does. Currently it results in re-allocating
the String[] for every label. Then you can call the pathToString variant which takes the array
and the length.
*** Separately, would be good if FacetsConfig had an appendElement(Appendable, int idx) to
append a specific element to the appendable. Then you could use a StringBuilder once, and
skip the encoding done for the entire path except the last element.
** Perhaps we should cap this {{(int) (res.value.doubleValue() / this.samplingRate)}} by e.g.
the number of non-deleted documents?

* About test:
** This comment is wrong: _//there will be 5000 hits._
** Why do you initialize your own Random? It's impossible to debug the test like that. You
should use {{random()}} instead.
** This comment is wrong? _//because a randomindexer is used, the results will vary each time._
-- it's not because the random indexer, but because of random sampling no?
** Would you mind formatting the test code? I see empty lines, curly braces that are sometimes
open in the next line etc. I can do it too before I commit it.

* I see that createSample still has the scores array bug -- it returns the original scores
array, irrespective of the sampled docs. Before you fix it, can you please add a testcase
which covers scores and fails?

I think we're very close!

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch,
LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java,
SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message