lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Bernstein <joels...@gmail.com>
Subject Re: Collapsing Query Parser returns one record per shard...was not expecting this...
Date Tue, 04 Aug 2015 02:14:29 GMT
Your findings are the expected behavior for the Collapsing qparser. The
Collapsing qparser requires records in the same collapsed field to be
located on the same shard. The typical approach for this is to use
composite Id routing to ensure that documents with the same collapse field
land on the same shard.

We should make this clear in the documentation.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 3, 2015 at 4:20 PM, Peter Lee <Peter.Lee@proquest.com> wrote:

> From my reading of the solr docs (e.g.
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> and https://cwiki.apache.org/confluence/display/solr/Result+Grouping),
> I've been under the impression that these two methods (result grouping and
> collapsing query parser) can both be used to eliminate duplicates from a
> result set (in our case, we have a duplication field that contains a
> 'signature' that identifies duplicates. We use our own signature for a
> variety of reasons that are tied to complex business requirements.).
>
> In a test environment I scattered 15 duplicate records (with another 10
> unique records) across a test system running Solr Cloud (Solr version
> 5.2.1) that had 4 shards and a replication factor of 2. I tried both result
> grouping and the collapsing query parser to remove duplicates. The result
> grouping worked as expected...the collapsing query parser did not.
>
> My results in using the collapsing query parser showed that Solr was in
> fact including into the result set one of the duplicate records from each
> shard (that is, I received FOUR duplicate records...and turning on debug
> showed that each of the four records came from a  unique shard)...when I
> was expecting solr to do the collapsing on the aggregated result and return
> only ONE of the duplicated records across ALL shards. It appears that solr
> is performing the collapsing query parsing on each individual shard, but
> then NOT performing the operation on the aggregated results from each shard.
>
> I have searched through the forums and checked the documentation as
> carefully as I can. I find no documentation or mention of this effect (one
> record being returned per shard) when using collapsing query parsing.
>
> Is this a known behavior? Am I just doing something wrong? Am I missing
> some search parameter? Am I simply not understanding correctly how this is
> supposed to work?
>
> For reference, I am including below the search url and the response I
> received. Any insights would be appreciated.
>
> Query:
> http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent=true&rows=1000&fq={!collapse%20field=dupid_s}&debugQuery=true
> <
> http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent=true&rows=1000&fq=%7b!collapse%20field=dupid_s%7d&debugQuery=true
> >
>
> Response (note that dupid_s = 900 is the duplicate value and that I have
> added comments in the output ***<comment>*** pointing out which shard
> responses came from):
>
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":31,
>     "params":{
>       "debugQuery":"true",
>       "indent":"true",
>       "q":"*:*",
>       "wt":"json",
>       "fq":"{!collapse field=dupid_s}",
>       "rows":"1000"}},
>   "response":{"numFound":14,"start":0,"maxScore":1.0,"docs":[
>       {
>         "storeid_s":"1002",
>         "dupid_s":"900", ***AcaColl_shard2_replica2***
>         "title_pqth":["Dupe Record #2"],
>         "_version_":1508241005512491008,
>         "indexTime_dt":"2015-07-31T19:25:09.914Z"},
>       {
>         "storeid_s":"8020",
>         "dupid_s":"2005",
>         "title_pqth":["Unique Record #5"],
>         "_version_":1508241005539753984,
>         "indexTime_dt":"2015-07-31T19:25:09.94Z"},
>       {
>         "storeid_s":"8023",
>         "dupid_s":"2008",
>         "title_pqth":["Unique Record #8"],
>         "_version_":1508241005540802560,
>         "indexTime_dt":"2015-07-31T19:25:09.94Z"},
>       {
>         "storeid_s":"8024",
>         "dupid_s":"2009",
>         "title_pqth":["Unique Record #9"],
>         "_version_":1508241005541851136,
>         "indexTime_dt":"2015-07-31T19:25:09.94Z"},
>       {
>         "storeid_s":"1007",
>         "dupid_s":"900", ***AcaColl_shard4_replica2***
>         "title_pqth":["Dupe Record #7"],
>         "_version_":1508241005515636736,
>         "indexTime_dt":"2015-07-31T19:25:09.91Z"},
>       {
>         "storeid_s":"8016",
>         "dupid_s":"2001",
>         "title_pqth":["Unique Record #1"],
>         "_version_":1508241005526122496,
>         "indexTime_dt":"2015-07-31T19:25:09.91Z"},
>       {
>         "storeid_s":"8019",
>         "dupid_s":"2004",
>         "title_pqth":["Unique Record #4"],
>         "_version_":1508241005528219648,
>         "indexTime_dt":"2015-07-31T19:25:09.91Z"},
>       {
>         "storeid_s":"1003",
>         "dupid_s":"900", ***AcaColl_shard1_replica1***
>         "title_pqth":["Dupe Record #3"],
>         "_version_":1508241005515636736,
>         "indexTime_dt":"2015-07-31T19:25:09.917Z"},
>       {
>         "storeid_s":"8017",
>         "dupid_s":"2002",
>         "title_pqth":["Unique Record #2"],
>         "_version_":1508241005518782464,
>         "indexTime_dt":"2015-07-31T19:25:09.917Z"},
>       {
>         "storeid_s":"8018",
>         "dupid_s":"2003",
>         "title_pqth":["Unique Record #3"],
>         "_version_":1508241005519831040,
>         "indexTime_dt":"2015-07-31T19:25:09.917Z"},
>       {
>         "storeid_s":"1001",
>         "dupid_s":"900", ***AcaColl_shard3_replica1***
>         "title_pqth":["Dupe Record #1"],
>         "_version_":1508241005511442432,
>         "indexTime_dt":"2015-07-31T19:25:09.912Z"},
>       {
>         "storeid_s":"8021",
>         "dupid_s":"2006",
>         "title_pqth":["Unique Record #6"],
>         "_version_":1508241005532413952,
>         "indexTime_dt":"2015-07-31T19:25:09.929Z"},
>       {
>         "storeid_s":"8022",
>         "dupid_s":"2007",
>         "title_pqth":["Unique Record #7"],
>         "_version_":1508241005533462528,
>         "indexTime_dt":"2015-07-31T19:25:09.938Z"},
>       {
>         "storeid_s":"8015",
>         "dupid_s":"2010",
>         "title_pqth":["Unique Record #10"],
>         "_version_":1508241005534511104,
>         "indexTime_dt":"2015-07-31T19:25:09.938Z"}]
>   },
>
>
> More background information:
>
> The following lists show the StoreIDs (unique key values) present on each
> shard. The asterisked StoreID is the one that was returned in the response
> shown above. Easy to see that one record per shard was returned.
> =Shard 1 StoreIDs=
> *1003
> 1010
> 8017
> 8018
>
> =Shard 2 StoreIDs=
> *1002
> 1004
> 1005
> 1006
> 1011
> 1015
> 8020
> 8023
> 8024
>
> = Shard 3 StoreIDs=
> *1001
> 1008
> 1014
> 8015
> 8021
> 8022
>
> = Shard 4 StoreIDs=
> *1007
> 1009
> 1012
> 1013
> 8016
> 8019
>
> Any relevant insights that can be offered would be appreciated...
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message