lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vasiliy Bout (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-8496) Facet search count numbers are falsified by older document versions
Date Fri, 15 Jan 2016 16:29:40 GMT

    [ https://issues.apache.org/jira/browse/SOLR-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102003#comment-15102003
] 

Vasiliy Bout edited comment on SOLR-8496 at 1/15/16 4:28 PM:
-------------------------------------------------------------

I developed a small example on how to reproduce this problem with the completely new core
with a very simple schema and about 20 documents in the core.

First of all, I created a new core with the following schema.xml:
{noformat}
<?xml version="1.0" ?>
<schema name="basic" version="1.1">
    <types>
        <fieldType name="string" class="solr.StrField" omitNorms="true" indexed="true"
stored="true"/>
        <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"
indexed="true" stored="true"/>
    </types>
    <fields>
        <field name="id" type="string" required="true"/>
        <field name="foo_s" type="string"/>
        <field name="bar_s" type="string" docValues="true"/>
        <field name="foo_i" type="int"/>
        <field name="bar_i" type="int" docValues="true"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <solrQueryParser defaultOperator="OR"/>
</schema>
{noformat}

After that, I generated a set of documents to fill the core with. I launched {{python}} interpreter
in the terminal and typed the following oneliner:
{noformat}
[ {"id":i,"foo_i":i,"bar_i":i,"foo_s":i,"bar_s":i} for i in range(1, 21) ]
{noformat}

It gave me a set of 20 documents. This is the same set but slightly formatted to be human
readable:
{noformat}
[
    {'bar_s': 1, 'foo_i': 1, 'bar_i': 1, 'foo_s': 1, 'id': 1},
    {'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2},
    {'bar_s': 3, 'foo_i': 3, 'bar_i': 3, 'foo_s': 3, 'id': 3},
    {'bar_s': 4, 'foo_i': 4, 'bar_i': 4, 'foo_s': 4, 'id': 4},
    {'bar_s': 5, 'foo_i': 5, 'bar_i': 5, 'foo_s': 5, 'id': 5},
    {'bar_s': 6, 'foo_i': 6, 'bar_i': 6, 'foo_s': 6, 'id': 6},
    {'bar_s': 7, 'foo_i': 7, 'bar_i': 7, 'foo_s': 7, 'id': 7},
    {'bar_s': 8, 'foo_i': 8, 'bar_i': 8, 'foo_s': 8, 'id': 8},
    {'bar_s': 9, 'foo_i': 9, 'bar_i': 9, 'foo_s': 9, 'id': 9},
    {'bar_s': 10, 'foo_i': 10, 'bar_i': 10, 'foo_s': 10, 'id': 10},
    {'bar_s': 11, 'foo_i': 11, 'bar_i': 11, 'foo_s': 11, 'id': 11},
    {'bar_s': 12, 'foo_i': 12, 'bar_i': 12, 'foo_s': 12, 'id': 12},
    {'bar_s': 13, 'foo_i': 13, 'bar_i': 13, 'foo_s': 13, 'id': 13},
    {'bar_s': 14, 'foo_i': 14, 'bar_i': 14, 'foo_s': 14, 'id': 14},
    {'bar_s': 15, 'foo_i': 15, 'bar_i': 15, 'foo_s': 15, 'id': 15},
    {'bar_s': 16, 'foo_i': 16, 'bar_i': 16, 'foo_s': 16, 'id': 16},
    {'bar_s': 17, 'foo_i': 17, 'bar_i': 17, 'foo_s': 17, 'id': 17},
    {'bar_s': 18, 'foo_i': 18, 'bar_i': 18, 'foo_s': 18, 'id': 18},
    {'bar_s': 19, 'foo_i': 19, 'bar_i': 19, 'foo_s': 19, 'id': 19},
    {'bar_s': 20, 'foo_i': 20, 'bar_i': 20, 'foo_s': 20, 'id': 20}
]
{noformat}

After that I opened Solr Admin page in my browser, went to the "Documents" tab of my core
and filled the core with the set of documents above. I selected the following parameters:
* Request-Handler (qt): {{/update/json}};
* Document Type: {{Solr Command (raw XML or JSON)}};
* Documents set to the above JSON generate in python interpreter.

After the Solr core is filled with documents, I add a single document once again, so this
document overwrites the previous one:
{noformat}
{'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2}
{noformat}

Now when I look at the "Overview" tab I see the following statistics:
{noformat}
Last Modified: less than a minute ago
Num Docs: 20
Max Doc: 21
Heap Memory Usage: -1
Deleted Docs: 1
Version: 7
Segment Count: 2
{noformat}

And at this stage all multi select facet queries give incorrect results. Since all the documents
in the core have unique values for all fields, all facet queries should give count {{1}} for
all values for all fields. Simple facet queries return correct results:

query is {{q=\*:\*&rows=0&facet=true&facet.limit=1&facet.field=foo_s&facet.field=foo_i&facet.field=bar_s&facet.field=bar_i}}
response is
{noformat}
{
  "responseHeader":{"status":0,"QTime":1},
  "response":{"numFound":20,"start":0,"docs":[]},
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "foo_s":["1",1],
      "foo_i":["1",1],
      "bar_s":["1",1],
      "bar_i":["1",1]
    },
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
{noformat}

And this is what we get for multi select facet query:

query is {{q=\*:\*&fq=\{!tag=a\}id:\*&rows=0&facet=true&facet.limit=1&facet.field=\{!ex=a\}foo_s&facet.field=\{!ex=a\}foo_i&facet.field=\{!ex=a\}bar_s&facet.field=\{!ex=a\}bar_i}}
response is
{noformat}
{
  "responseHeader":{"status":0,"QTime":2},
  "response":{"numFound":20,"start":0,"docs":[]},
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "foo_s":["2",2],
      "foo_i":["2",2],
      "bar_s":["2",2],
      "bar_i":["2",2]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
{noformat}

So we get count {{2}} for value {{"2"}}, i.e. replaced (old) version of the document with
{{id=2}} is taken into account when using multi selection facets.



was (Author: vasiliy.bout):
I developed a small example on how to reproduce this problem with the completely new core
with a very simple schema and about 20 documents in the core.

First of all, I created a new core with the following schema.xml:
{noformat}
<?xml version="1.0" ?>
<schema name="basic" version="1.1">
    <types>
        <fieldType name="string" class="solr.StrField" omitNorms="true" indexed="true"
stored="true"/>
        <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"
indexed="true" stored="true"/>
    </types>
    <fields>
        <field name="id" type="string" required="true"/>
        <field name="foo_s" type="string"/>
        <field name="bar_s" type="string" docValues="true"/>
        <field name="foo_i" type="int"/>
        <field name="bar_i" type="int" docValues="true"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <solrQueryParser defaultOperator="OR"/>
</schema>
{noformat}

After that, I generated a set of documents to fill the core with. I launched {{python}} interpreter
in the terminal and typed the following oneliner:
{noformat}
[ {"id":i,"foo_i":i,"bar_i":i,"foo_s":i,"bar_s":i} for i in range(1, 21) ]
{noformat}

It gave me a set of 20 documents. This is the same set but slightly formatted to be human
readable:
{noformat}
[
    {'bar_s': 1, 'foo_i': 1, 'bar_i': 1, 'foo_s': 1, 'id': 1},
    {'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2},
    {'bar_s': 3, 'foo_i': 3, 'bar_i': 3, 'foo_s': 3, 'id': 3},
    {'bar_s': 4, 'foo_i': 4, 'bar_i': 4, 'foo_s': 4, 'id': 4},
    {'bar_s': 5, 'foo_i': 5, 'bar_i': 5, 'foo_s': 5, 'id': 5},
    {'bar_s': 6, 'foo_i': 6, 'bar_i': 6, 'foo_s': 6, 'id': 6},
    {'bar_s': 7, 'foo_i': 7, 'bar_i': 7, 'foo_s': 7, 'id': 7},
    {'bar_s': 8, 'foo_i': 8, 'bar_i': 8, 'foo_s': 8, 'id': 8},
    {'bar_s': 9, 'foo_i': 9, 'bar_i': 9, 'foo_s': 9, 'id': 9},
    {'bar_s': 10, 'foo_i': 10, 'bar_i': 10, 'foo_s': 10, 'id': 10},
    {'bar_s': 11, 'foo_i': 11, 'bar_i': 11, 'foo_s': 11, 'id': 11},
    {'bar_s': 12, 'foo_i': 12, 'bar_i': 12, 'foo_s': 12, 'id': 12},
    {'bar_s': 13, 'foo_i': 13, 'bar_i': 13, 'foo_s': 13, 'id': 13},
    {'bar_s': 14, 'foo_i': 14, 'bar_i': 14, 'foo_s': 14, 'id': 14},
    {'bar_s': 15, 'foo_i': 15, 'bar_i': 15, 'foo_s': 15, 'id': 15},
    {'bar_s': 16, 'foo_i': 16, 'bar_i': 16, 'foo_s': 16, 'id': 16},
    {'bar_s': 17, 'foo_i': 17, 'bar_i': 17, 'foo_s': 17, 'id': 17},
    {'bar_s': 18, 'foo_i': 18, 'bar_i': 18, 'foo_s': 18, 'id': 18},
    {'bar_s': 19, 'foo_i': 19, 'bar_i': 19, 'foo_s': 19, 'id': 19},
    {'bar_s': 20, 'foo_i': 20, 'bar_i': 20, 'foo_s': 20, 'id': 20}
]
{noformat}

After that I opened Solr Admin page in my browser, went to the "Documents" tab of my core
and filled the core with the set of documents above. I selected the following parameters:
* Request-Handler (qt): {{/update/json}};
* Document Type: {{Solr Command (raw XML or JSON)}};
* Documents set to the above JSON generate in python interpreter.

After the Solr core is filled with documents, I add a single document once again, so this
document overwrites the previous one:
{noformat}
{'bar_s': 2, 'foo_i': 2, 'bar_i': 2, 'foo_s': 2, 'id': 2}
{noformat}

Now when I look at the "Overview" tab I see the following statistics:
{noformat}
Last Modified: less than a minute ago
Num Docs: 20
Max Doc: 21
Heap Memory Usage: -1
Deleted Docs: 1
Version: 25
Segment Count: 2
{noformat}

And at this stage all multi select facet queries give incorrect results. Since all the documents
in the core have unique values for all fields, all facet queries should give count {{1}} for
all values for all fields. Simple facet queries return correct results:

query is {{q=\*:\*&rows=0&facet=true&facet.limit=1&facet.field=foo_s&facet.field=foo_i&facet.field=bar_s&facet.field=bar_i}}
response is
{noformat}
{
  "responseHeader":{"status":0,"QTime":1},
  "response":{"numFound":20,"start":0,"docs":[]},
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "foo_s":["1",1],
      "foo_i":["1",1],
      "bar_s":["1",1],
      "bar_i":["1",1]
    },
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
{noformat}

And this is what we get for multi select facet query:

query is {{q=\*:\*&fq=\{!tag=a\}id:\*&rows=0&facet=true&facet.limit=1&facet.field=\{!ex=a\}foo_s&facet.field=\{!ex=a\}foo_i&facet.field=\{!ex=a\}bar_s&facet.field=\{!ex=a\}bar_i}}
response is
{noformat}
{
  "responseHeader":{"status":0,"QTime":2},
  "response":{"numFound":20,"start":0,"docs":[]},
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "foo_s":["2",2],
      "foo_i":["2",2],
      "bar_s":["2",2],
      "bar_i":["2",2]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
{noformat}

So we get count {{2}} for value {{"2"}}, i.e. replaced (old) version of the document with
{{id=2}} is taken into account when using multi selection facets.


> Facet search count numbers are falsified by older document versions
> -------------------------------------------------------------------
>
>                 Key: SOLR-8496
>                 URL: https://issues.apache.org/jira/browse/SOLR-8496
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 5.4
>         Environment: Linux 3.16.0-4-amd64 x86_64 Debian 8.2
> openjdk-7-jre-headless:amd64   version 7u91-2.6.3-1~deb8u1
> solr-5.4.0, extracted from official tar
> Default solr settings from install script:SOLR_HEAP="512m"
> GC_LOG_OPTS="-verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails \
> -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime"
> GC_TUNE="-XX:NewRatio=3 \
> -XX:SurvivorRatio=4 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=8 \
> -XX:+UseConcMarkSweepGC \
> -XX:+UseParNewGC \
> -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
> -XX:+CMSScavengeBeforeRemark \
> -XX:PretenureSizeThreshold=64m \
> -XX:+UseCMSInitiatingOccupancyOnly \
> -XX:CMSInitiatingOccupancyFraction=50 \
> -XX:CMSMaxAbortablePrecleanTime=6000 \
> -XX:+CMSParallelRemarkEnabled \
> -XX:+ParallelRefProcEnabled"
> SOLR_OPTS="$SOLR_OPTS -Xss256k"
>            Reporter: Andreas Müller
>
> Our setup is based on multiple cores. In One core we have a multi-filed with integer
values. and some other unimportant fields. We're using multi-faceting for this field.
> We're querying a test scenario with:
> {code}
> http://localhost:8983/solr/core-name/select?q=dummyask: (true) AND manufacturer: false
AND id: (15039 16882 10850 20781)&fq={!tag=professions}professions: (59)&fl=id&wt=json&indent=true&facet=true&facet.field={!ex=professions}professions
> {code}
> - Query: (numDocs:48545, maxDoc:48545)
> {code:xml}
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> </lst>
> <result name="response" numFound="4" start="0">
> <doc>
> <int name="id">10850</int>
> </doc>
> <doc>
> <int name="id">16882</int>
> </doc>
> <doc>
> <int name="id">15039</int>
> </doc>
> <doc>
> <int name="id">20781</int>
> </doc>
> </result>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="professions">
> <int name="59">4</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> <lst name="facet_intervals"/>
> <lst name="facet_heatmaps"/>
> </lst>
> </response>
> {code}
> - Then we update one document and change some fields (numDocs:48545, maxDoc:48546) *The
number of maxDocs is increased*
> {code:xml}
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> </lst>
> <result name="response" numFound="4" start="0">
> <doc>
> <int name="id">10850</int>
> </doc>
> <doc>
> <int name="id">16882</int>
> </doc>
> <doc>
> <int name="id">15039</int>
> </doc>
> <doc>
> <int name="id">20781</int>
> </doc>
> </result>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="professions">
> <int name="59">5</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> <lst name="facet_intervals"/>
> <lst name="facet_heatmaps"/>
> </lst>
> </response>
> {code}
> *The Problem:*
> In the first query, we're getting a facet count of 4, which is correct. After updating
one document, we're getting 5 as a result wich is not correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message