lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
Date Mon, 09 Jul 2018 17:33:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537306#comment-16537306
] 

Hoss Man commented on SOLR-12343:
---------------------------------

Ok ... fresh eyes and i see the problem.

When {{final int overreq = 0}} we don't add any "filler" docs, which means that when the nested
facet test happens, shardC0 and shardC1 disagree about the "top term" for the parent facet
on the {{all_ss}} field -- shardC0 only knows about {{z_al}} while shardC1 has a tie between
{{z_all} and {{some}} and {{some}} wins the tie due to index order -- so when that parent
facet uses {{overrequest:0}} the initial merge logic doesn't have any contributions from shardC1
for the chosen {{all_ss:z_all}} bucket ... so it only knows to ask to refine the top3 child
buckets it does know about (from shardC0): "A,B,C".  If the parent facet uses any overrequest
larger then 0, then it would get the {{all_ss:z_all}} bucket from shardC1 as well, and have
some child buckets to consider to know that C is a bad candidate, and it should be refining
X instead.

On the flip side, when {{final int overreq = 1}} (or anything higher) the addition of even
a few filler docs is enough to skew the {{all_ss}} term stats on shardC1, such that it *also*
thinkgs {{z_all}} is the top term, so regardless of the amount of overrequest on the top facet,
the phase #1 merge has buckets from both shards for the child facet to consider.

----

I remember when i was writing this test, and i include the {{some}} terms the entire point
was to stress the case where the 2 shards disagree about the "top" term term from the parent
facet -- but apparently when adding the filler docs/terms randomization i broke that so that
it's not always true, it only happens when there are no filler docs.  But it also seems like
an unfair test, because when they do disagree, there's no reason for hte merge logic to think
X is a worthwhile term to refine. what mattes is that in this case, C is accurately refined

I'm working up a test fix...


> JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-12343
>                 URL: https://issues.apache.org/jira/browse/SOLR-12343
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Yonik Seeley
>            Priority: Major
>         Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch,
SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement can cause
_refined_ buckets to be "bumped out" of the topN based on the refined counts/stats depending
on the sort - causing _unrefined_ buckets originally discounted in phase#2 to bubble up into
the topN and be returned to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a {{sort: 'count
asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low shard1
counts
>  ** but *not* returned at all by shard2, because these terms both have very high shard2
counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete count/stat/sub-facet
data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a significantly
higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the client have counts/stats
that are the cumulation of all shards, but termY only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. Additional
overrequest just increases the number of "extra" terms needed in the index with "better" sort
values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement can cause
a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) asc|desc}}
, etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message