lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-9142) Improve JSON nested facets effeciency
Date Mon, 23 May 2016 14:39:12 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296448#comment-15296448
] 

Yonik Seeley commented on SOLR-9142:
------------------------------------

Yes, the issue here is faceting on a string field with a high cardinality compared to it's
domain is less efficient than it could be.
For those cases, the direct map slot == ord is sub optimal and we should go instead with a
hash based approach (something like we do with numeric faceting).
Perhaps creating an accumulator implementation that does the mapping before calling another
accumulator.

Even the hashing approach we use with numeric faceting could perhaps be improved on... today
we use the slot in the hash table as the slot in the accumulator (think of each accumulator
as a bunch of parallel hash tables), but we could alternately hash to a dense table (i.e.
the hash would hold the slot number).  This really only applies to accumulators needed in
phase 1 (sorting), but could make any that contained a lot of state per slot more efficient.


> Improve JSON nested facets effeciency
> -------------------------------------
>
>                 Key: SOLR-9142
>                 URL: https://issues.apache.org/jira/browse/SOLR-9142
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Varun Thacker
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and {{sub_facet_unique_td}}
which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The nested query
for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> 	{
> 		"top_facet_s": {
> 			"type": "terms",
> 			"limit": -1,
> 			"field": "top_facet_s",
> 			"mincount": 1,
> 			"excludeTags": "ANY",
> 			"facet": {
> 				"sub_facet_unique_s": {
> 					"type": "terms",
> 					"limit": 1,
> 					"field": "sub_facet_unique_s",
> 					"mincount": 1
> 				}
> 			}
> 		}
> 	}
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> 	{
> 		"top_facet_s": {
> 			"type": "terms",
> 			"limit": -1,
> 			"field": "top_facet_s",
> 			"mincount": 1,
> 			"excludeTags": "ANY",
> 			"facet": {
> 				"sub_facet_unique_s": {
> 					"type": "terms",
> 					"limit": 1,
> 					"field": "sub_facet_unique_td",
> 					"mincount": 1
> 				}
> 			}
> 		}
> 	}
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow compared
to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets on each
of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call {{createCollectAcc}}
with nDocs=0 and numSlots=2M . This then initializes an array of 2M. So we create a 2M array
1000 times for this one query which from what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a CountSlotAcc which
doesn't assign a huge array. In this query it calls {{createCollectAcc}} with numDocs=2k and
numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and we use the
array position as the ordinal and value as the count. If we could improve on this it would
speed things up significantly? For sub-facets we know the maximum cardinality can be at max
the top level bucket count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message