lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Lee <>
Subject Collapsing Query Parser returns one record per shard...was not expecting this...
Date Mon, 03 Aug 2015 20:20:06 GMT
>From my reading of the solr docs (e.g.
and, I've been under the
impression that these two methods (result grouping and collapsing query parser) can both be
used to eliminate duplicates from a result set (in our case, we have a duplication field that
contains a 'signature' that identifies duplicates. We use our own signature for a variety
of reasons that are tied to complex business requirements.).

In a test environment I scattered 15 duplicate records (with another 10 unique records) across
a test system running Solr Cloud (Solr version 5.2.1) that had 4 shards and a replication
factor of 2. I tried both result grouping and the collapsing query parser to remove duplicates.
The result grouping worked as expected...the collapsing query parser did not.

My results in using the collapsing query parser showed that Solr was in fact including into
the result set one of the duplicate records from each shard (that is, I received FOUR duplicate
records...and turning on debug showed that each of the four records came from a  unique shard)...when
I was expecting solr to do the collapsing on the aggregated result and return only ONE of
the duplicated records across ALL shards. It appears that solr is performing the collapsing
query parsing on each individual shard, but then NOT performing the operation on the aggregated
results from each shard.

I have searched through the forums and checked the documentation as carefully as I can. I
find no documentation or mention of this effect (one record being returned per shard) when
using collapsing query parsing.

Is this a known behavior? Am I just doing something wrong? Am I missing some search parameter?
Am I simply not understanding correctly how this is supposed to work?

For reference, I am including below the search url and the response I received. Any insights
would be appreciated.


Response (note that dupid_s = 900 is the duplicate value and that I have added comments in
the output ***<comment>*** pointing out which shard responses came from):

      "fq":"{!collapse field=dupid_s}",
        "dupid_s":"900", ***AcaColl_shard2_replica2***
        "title_pqth":["Dupe Record #2"],
        "title_pqth":["Unique Record #5"],
        "title_pqth":["Unique Record #8"],
        "title_pqth":["Unique Record #9"],
        "dupid_s":"900", ***AcaColl_shard4_replica2***
        "title_pqth":["Dupe Record #7"],
        "title_pqth":["Unique Record #1"],
        "title_pqth":["Unique Record #4"],
        "dupid_s":"900", ***AcaColl_shard1_replica1***
        "title_pqth":["Dupe Record #3"],
        "title_pqth":["Unique Record #2"],
        "title_pqth":["Unique Record #3"],
        "dupid_s":"900", ***AcaColl_shard3_replica1***
        "title_pqth":["Dupe Record #1"],
        "title_pqth":["Unique Record #6"],
        "title_pqth":["Unique Record #7"],
        "title_pqth":["Unique Record #10"],

More background information:

The following lists show the StoreIDs (unique key values) present on each shard. The asterisked
StoreID is the one that was returned in the response shown above. Easy to see that one record
per shard was returned.
=Shard 1 StoreIDs=

=Shard 2 StoreIDs=

= Shard 3 StoreIDs=

= Shard 4 StoreIDs=

Any relevant insights that can be offered would be appreciated...

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message