lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Lee <Peter....@proquest.com>
Subject RE: Collapsing Query Parser returns one record per shard...was not expecting this...
Date Tue, 04 Aug 2015 12:58:24 GMT
Joel,

Thank you for that  information.

I had not heard of composite ID routing, and found a post (by you) on the feature that was
most instructive (https://lucidworks.com/blog/solr-cloud-document-routing/).

Thanks for clearing up the behavior of the collapsing query parser. Sadly, I doubt co-locating
the records is going to be possible for us. The dupe field we use to eliminate duplicates
at search time has the property that it can CHANGE over time in a way that is not really predictable
as it changes based upon the other content in the system. If we went that route we'd have
to put into place a complex mechanism to move/relocate records after they've been indexed...and
I don't think that is going to be the solution for us.

On another note, I've been away from Solr since version 4.2 and am now returning to version
5.2.1. Back in the day, grouping gave us a HORRIBLE performance hit, even after we spent a
lot of time trying to tune the system for it. It appears now from what we are seeing from
testing is that the grouping performance has been greatly improved. I know it is and always
will be a computationally intensive task, but it is good news that it has seen such performance
improvements.

Again, thanks for the information and for the heads up regarding composite id routing. I'll
have to take a closer look at that feature and see if we can take advantage of it in some
way.

Thank you.

-----Original Message-----
From: Joel Bernstein [mailto:joelsolr@gmail.com] 
Sent: Monday, August 03, 2015 10:51 PM
To: solr-user@lucene.apache.org
Subject: Re: Collapsing Query Parser returns one record per shard...was not expecting this...

One of things to keep in mind with Grouping is that if you are relying on an accurate group
count (ngroups) then you will also have to collocate documents based on the grouping field.

The main advantage to the Collapsing qparser plugin is it provides fast field collapsing on
high cardinality fields with an accurate group count.

If you don't need ngroups, then Grouping is usually just as fast if not faster.


Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 3, 2015 at 10:14 PM, Joel Bernstein <joelsolr@gmail.com> wrote:

> Your findings are the expected behavior for the Collapsing qparser. 
> The Collapsing qparser requires records in the same collapsed field to 
> be located on the same shard. The typical approach for this is to use 
> composite Id routing to ensure that documents with the same collapse 
> field land on the same shard.
>
> We should make this clear in the documentation.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Aug 3, 2015 at 4:20 PM, Peter Lee <Peter.Lee@proquest.com> wrote:
>
>> From my reading of the solr docs (e.g.
>> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+
>> Results and 
>> https://cwiki.apache.org/confluence/display/solr/Result+Grouping),
>> I've been under the impression that these two methods (result 
>> grouping and collapsing query parser) can both be used to eliminate 
>> duplicates from a result set (in our case, we have a duplication 
>> field that contains a 'signature' that identifies duplicates. We use 
>> our own signature for a variety of reasons that are tied to complex business requirements.).
>>
>> In a test environment I scattered 15 duplicate records (with another 
>> 10 unique records) across a test system running Solr Cloud (Solr 
>> version
>> 5.2.1) that had 4 shards and a replication factor of 2. I tried both 
>> result grouping and the collapsing query parser to remove duplicates. 
>> The result grouping worked as expected...the collapsing query parser did not.
>>
>> My results in using the collapsing query parser showed that Solr was 
>> in fact including into the result set one of the duplicate records 
>> from each shard (that is, I received FOUR duplicate records...and 
>> turning on debug showed that each of the four records came from a  
>> unique shard)...when I was expecting solr to do the collapsing on the 
>> aggregated result and return only ONE of the duplicated records 
>> across ALL shards. It appears that solr is performing the collapsing 
>> query parsing on each individual shard, but then NOT performing the operation on
the aggregated results from each shard.
>>
>> I have searched through the forums and checked the documentation as 
>> carefully as I can. I find no documentation or mention of this effect 
>> (one record being returned per shard) when using collapsing query parsing.
>>
>> Is this a known behavior? Am I just doing something wrong? Am I 
>> missing some search parameter? Am I simply not understanding 
>> correctly how this is supposed to work?
>>
>> For reference, I am including below the search url and the response I 
>> received. Any insights would be appreciated.
>>
>> Query:
>> http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent
>> =true&rows=1000&fq={!collapse%20field=dupid_s}&debugQuery=true
>> <http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&inden
>> t=true&rows=1000&fq=%7B!collapse%20field=dupid_s%7D&debugQuery=true>
>> <
>> http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*&wt=json&indent
>> =true&rows=1000&fq=%7b!collapse%20field=dupid_s%7d&debugQuery=true
>> >
>>
>> Response (note that dupid_s = 900 is the duplicate value and that I 
>> have added comments in the output ***<comment>*** pointing out which 
>> shard responses came from):
>>
>> {
>>   "responseHeader":{
>>     "status":0,
>>     "QTime":31,
>>     "params":{
>>       "debugQuery":"true",
>>       "indent":"true",
>>       "q":"*:*",
>>       "wt":"json",
>>       "fq":"{!collapse field=dupid_s}",
>>       "rows":"1000"}},
>>   "response":{"numFound":14,"start":0,"maxScore":1.0,"docs":[
>>       {
>>         "storeid_s":"1002",
>>         "dupid_s":"900", ***AcaColl_shard2_replica2***
>>         "title_pqth":["Dupe Record #2"],
>>         "_version_":1508241005512491008,
>>         "indexTime_dt":"2015-07-31T19:25:09.914Z"},
>>       {
>>         "storeid_s":"8020",
>>         "dupid_s":"2005",
>>         "title_pqth":["Unique Record #5"],
>>         "_version_":1508241005539753984,
>>         "indexTime_dt":"2015-07-31T19:25:09.94Z"},
>>       {
>>         "storeid_s":"8023",
>>         "dupid_s":"2008",
>>         "title_pqth":["Unique Record #8"],
>>         "_version_":1508241005540802560,
>>         "indexTime_dt":"2015-07-31T19:25:09.94Z"},
>>       {
>>         "storeid_s":"8024",
>>         "dupid_s":"2009",
>>         "title_pqth":["Unique Record #9"],
>>         "_version_":1508241005541851136,
>>         "indexTime_dt":"2015-07-31T19:25:09.94Z"},
>>       {
>>         "storeid_s":"1007",
>>         "dupid_s":"900", ***AcaColl_shard4_replica2***
>>         "title_pqth":["Dupe Record #7"],
>>         "_version_":1508241005515636736,
>>         "indexTime_dt":"2015-07-31T19:25:09.91Z"},
>>       {
>>         "storeid_s":"8016",
>>         "dupid_s":"2001",
>>         "title_pqth":["Unique Record #1"],
>>         "_version_":1508241005526122496,
>>         "indexTime_dt":"2015-07-31T19:25:09.91Z"},
>>       {
>>         "storeid_s":"8019",
>>         "dupid_s":"2004",
>>         "title_pqth":["Unique Record #4"],
>>         "_version_":1508241005528219648,
>>         "indexTime_dt":"2015-07-31T19:25:09.91Z"},
>>       {
>>         "storeid_s":"1003",
>>         "dupid_s":"900", ***AcaColl_shard1_replica1***
>>         "title_pqth":["Dupe Record #3"],
>>         "_version_":1508241005515636736,
>>         "indexTime_dt":"2015-07-31T19:25:09.917Z"},
>>       {
>>         "storeid_s":"8017",
>>         "dupid_s":"2002",
>>         "title_pqth":["Unique Record #2"],
>>         "_version_":1508241005518782464,
>>         "indexTime_dt":"2015-07-31T19:25:09.917Z"},
>>       {
>>         "storeid_s":"8018",
>>         "dupid_s":"2003",
>>         "title_pqth":["Unique Record #3"],
>>         "_version_":1508241005519831040,
>>         "indexTime_dt":"2015-07-31T19:25:09.917Z"},
>>       {
>>         "storeid_s":"1001",
>>         "dupid_s":"900", ***AcaColl_shard3_replica1***
>>         "title_pqth":["Dupe Record #1"],
>>         "_version_":1508241005511442432,
>>         "indexTime_dt":"2015-07-31T19:25:09.912Z"},
>>       {
>>         "storeid_s":"8021",
>>         "dupid_s":"2006",
>>         "title_pqth":["Unique Record #6"],
>>         "_version_":1508241005532413952,
>>         "indexTime_dt":"2015-07-31T19:25:09.929Z"},
>>       {
>>         "storeid_s":"8022",
>>         "dupid_s":"2007",
>>         "title_pqth":["Unique Record #7"],
>>         "_version_":1508241005533462528,
>>         "indexTime_dt":"2015-07-31T19:25:09.938Z"},
>>       {
>>         "storeid_s":"8015",
>>         "dupid_s":"2010",
>>         "title_pqth":["Unique Record #10"],
>>         "_version_":1508241005534511104,
>>         "indexTime_dt":"2015-07-31T19:25:09.938Z"}]
>>   },
>>
>>
>> More background information:
>>
>> The following lists show the StoreIDs (unique key values) present on 
>> each shard. The asterisked StoreID is the one that was returned in 
>> the response shown above. Easy to see that one record per shard was returned.
>> =Shard 1 StoreIDs=
>> *1003
>> 1010
>> 8017
>> 8018
>>
>> =Shard 2 StoreIDs=
>> *1002
>> 1004
>> 1005
>> 1006
>> 1011
>> 1015
>> 8020
>> 8023
>> 8024
>>
>> = Shard 3 StoreIDs=
>> *1001
>> 1008
>> 1014
>> 8015
>> 8021
>> 8022
>>
>> = Shard 4 StoreIDs=
>> *1007
>> 1009
>> 1012
>> 1013
>> 8016
>> 8019
>>
>> Any relevant insights that can be offered would be appreciated...
>>
>
>
Mime
View raw message