lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Thacker (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SOLR-12820) Auto pick method:dvhash based on thresholds
Date Mon, 01 Oct 2018 20:59:00 GMT
Varun Thacker created SOLR-12820:
------------------------------------

             Summary: Auto pick method:dvhash based on thresholds
                 Key: SOLR-12820
                 URL: https://issues.apache.org/jira/browse/SOLR-12820
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Facet Module
            Reporter: Varun Thacker


I've worked with two users last week where explicitly using method:dvhash improved the faceting
speeds drastically.

The common theme in both the use-cases were:  One collection hosting data for multiple users. 
We always filter documents for one user ( therby limiting the number of documents drastically
) and then perfoming a complex nested JSON facet.

Both use-cases fit perfectly in this criteria that [~yonik@apache.org] mentioed on SOLR-9142
{quote}faceting on a string field with a high cardinality compared to it's domain is less
efficient than it could be.
{quote}
And DVHASH was the perfect optimization for these use-cases.

We are using the facet stream expression in one of the use-cases which doesn't expose the
method param. We could expose the method param to facet stream but I feel the better approach
to solve this problem would be to address this TODO in the code withing the JSON Facet Module
{code:java}
      if (mincount > 0 && prefix == null && (ntype != null || method
== FacetMethod.DVHASH)) {
        // TODO can we auto-pick for strings when term cardinality is much greater
than DocSet cardinality?
        //   or if we don't know cardinality but DocSet size is very small
        return new FacetFieldProcessorByHashDV(fcontext, this, sf);{code}
I thought about this a little and this was the approach I am thinking currently to tackle
this problem
{code:java}
int matchingDocs = fcontext.base.size();
int totalDocs = fcontext.searcher.getIndexReader().maxDoc();
//if matchingDocs is close to the totalDocs then we aren't filtering many documents.
//that means the array approach would probably be better than the dvhash approach

//Trying to find the cardinality for the matchingDocs would be expensive.
//Also for totalDocs we don't have a global cardinality present at index time but we have
a per segment cardinality

//So using the number of matches as an alternate heuristic would do the job here?{code}
Any thoughts if this approach makes sense? it could be I'm thinking of this approach just
because both the users I worked with last week fell in this cateogory.

 

cc [~dsmiley] [~joel.bernstein]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message