lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Lee <Peter....@proquest.com>
Subject Collapsing Query Parser returns one record per shard...was not expecting this...
Date Mon, 03 Aug 2015 20:20:06 GMT
>From my reading of the solr docs (e.g. https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
and https://cwiki.apache.org/confluence/display/solr/Result+Grouping), I've been under the
impression that these two methods (result grouping and collapsing query parser) can both be
used to eliminate duplicates from a result set (in our case, we have a duplication field that
contains a 'signature' that identifies duplicates. We use our own signature for a variety
of reasons that are tied to complex business requirements.).

In a test environment I scattered 15 duplicate records (with another 10 unique records) across
a test system running Solr Cloud (Solr version 5.2.1) that had 4 shards and a replication
factor of 2. I tried both result grouping and the collapsing query parser to remove duplicates.
The result grouping worked as expected...the collapsing query parser did not.

My results in using the collapsing query parser showed that Solr was in fact including into
the result set one of the duplicate records from each shard (that is, I received FOUR duplicate
records...and turning on debug showed that each of the four records came from a  unique shard)...when
I was expecting solr to do the collapsing on the aggregated result and return only ONE of
the duplicated records across ALL shards. It appears that solr is performing the collapsing
query parsing on each individual shard, but then NOT performing the operation on the aggregated
results from each shard.

I have searched through the forums and checked the documentation as carefully as I can. I
find no documentation or mention of this effect (one record being returned per shard) when
using collapsing query parsing.

Is this a known behavior? Am I just doing something wrong? Am I missing some search parameter?
Am I simply not understanding correctly how this is supposed to work?

For reference, I am including below the search url and the response I received. Any insights
would be appreciated.

Query: http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent=true&rows=1000&fq={!collapse%20field=dupid_s}&debugQuery=true<http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent=true&rows=1000&fq=%7b!collapse%20field=dupid_s%7d&debugQuery=true>

Response (note that dupid_s = 900 is the duplicate value and that I have added comments in
the output ***<comment>*** pointing out which shard responses came from):

{
  "responseHeader":{
    "status":0,
    "QTime":31,
    "params":{
      "debugQuery":"true",
      "indent":"true",
      "q":"*:*",
      "wt":"json",
      "fq":"{!collapse field=dupid_s}",
      "rows":"1000"}},
  "response":{"numFound":14,"start":0,"maxScore":1.0,"docs":[
      {
        "storeid_s":"1002",
        "dupid_s":"900", ***AcaColl_shard2_replica2***
        "title_pqth":["Dupe Record #2"],
        "_version_":1508241005512491008,
        "indexTime_dt":"2015-07-31T19:25:09.914Z"},
      {
        "storeid_s":"8020",
        "dupid_s":"2005",
        "title_pqth":["Unique Record #5"],
        "_version_":1508241005539753984,
        "indexTime_dt":"2015-07-31T19:25:09.94Z"},
      {
        "storeid_s":"8023",
        "dupid_s":"2008",
        "title_pqth":["Unique Record #8"],
        "_version_":1508241005540802560,
        "indexTime_dt":"2015-07-31T19:25:09.94Z"},
      {
        "storeid_s":"8024",
        "dupid_s":"2009",
        "title_pqth":["Unique Record #9"],
        "_version_":1508241005541851136,
        "indexTime_dt":"2015-07-31T19:25:09.94Z"},
      {
        "storeid_s":"1007",
        "dupid_s":"900", ***AcaColl_shard4_replica2***
        "title_pqth":["Dupe Record #7"],
        "_version_":1508241005515636736,
        "indexTime_dt":"2015-07-31T19:25:09.91Z"},
      {
        "storeid_s":"8016",
        "dupid_s":"2001",
        "title_pqth":["Unique Record #1"],
        "_version_":1508241005526122496,
        "indexTime_dt":"2015-07-31T19:25:09.91Z"},
      {
        "storeid_s":"8019",
        "dupid_s":"2004",
        "title_pqth":["Unique Record #4"],
        "_version_":1508241005528219648,
        "indexTime_dt":"2015-07-31T19:25:09.91Z"},
      {
        "storeid_s":"1003",
        "dupid_s":"900", ***AcaColl_shard1_replica1***
        "title_pqth":["Dupe Record #3"],
        "_version_":1508241005515636736,
        "indexTime_dt":"2015-07-31T19:25:09.917Z"},
      {
        "storeid_s":"8017",
        "dupid_s":"2002",
        "title_pqth":["Unique Record #2"],
        "_version_":1508241005518782464,
        "indexTime_dt":"2015-07-31T19:25:09.917Z"},
      {
        "storeid_s":"8018",
        "dupid_s":"2003",
        "title_pqth":["Unique Record #3"],
        "_version_":1508241005519831040,
        "indexTime_dt":"2015-07-31T19:25:09.917Z"},
      {
        "storeid_s":"1001",
        "dupid_s":"900", ***AcaColl_shard3_replica1***
        "title_pqth":["Dupe Record #1"],
        "_version_":1508241005511442432,
        "indexTime_dt":"2015-07-31T19:25:09.912Z"},
      {
        "storeid_s":"8021",
        "dupid_s":"2006",
        "title_pqth":["Unique Record #6"],
        "_version_":1508241005532413952,
        "indexTime_dt":"2015-07-31T19:25:09.929Z"},
      {
        "storeid_s":"8022",
        "dupid_s":"2007",
        "title_pqth":["Unique Record #7"],
        "_version_":1508241005533462528,
        "indexTime_dt":"2015-07-31T19:25:09.938Z"},
      {
        "storeid_s":"8015",
        "dupid_s":"2010",
        "title_pqth":["Unique Record #10"],
        "_version_":1508241005534511104,
        "indexTime_dt":"2015-07-31T19:25:09.938Z"}]
  },


More background information:

The following lists show the StoreIDs (unique key values) present on each shard. The asterisked
StoreID is the one that was returned in the response shown above. Easy to see that one record
per shard was returned.
=Shard 1 StoreIDs=
*1003
1010
8017
8018

=Shard 2 StoreIDs=
*1002
1004
1005
1006
1011
1015
8020
8023
8024

= Shard 3 StoreIDs=
*1001
1008
1014
8015
8021
8022

= Shard 4 StoreIDs=
*1007
1009
1012
1013
8016
8019

Any relevant insights that can be offered would be appreciated...

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message