lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Grolla <matteo.gro...@gmail.com>
Subject Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting
Date Sat, 30 May 2015 13:04:58 GMT
Wow,
	thanks both for the suggestions

Erik: good point for the uneven shard load
	I'm not worried about the growth of a particular shard, in case I'd use shard splitting and
if necessary add a server to the cluster
	but even if I manage to spread docs of typeA producer evenly on the cluster I could have
an uneven query distribution (the two problems are very similar)
		at time t I could have a shard queried by 11 type A producers while another shard is being
queried by a single type A producer, not ideal
	So I could use few bits (0 or 1) of the composite id for typeA producer's docs to avoid those
kinds of problems

For typeB and typeC producers the problems discussed above seem unlikely, so I'd like to weight
pros and cons of sharding on userid
pros
	I'm reducing the size of the problem, instead of searching across the whole repository I'm
searching only a part of it
cons
	I could have uneven distribution of documents and queries across the cluster (unlikely, there
are lots of users of typeB, typeC)
	docs for one user aren't searched in parallel using more shards
		this could be useful if one users produces so many docs to benefit from sharding (should
happen only for typeA)

I think the pro is appealing, under these hypothesis if users of type B, C increase I can
scale the system without many concerns

Do you agree? 


Il giorno 29/mag/2015, alle ore 20:18, Reitzel, Charles ha scritto:

> Thanks, Erick.   I appreciate the sanity check.
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com] 
> Sent: Thursday, May 28, 2015 5:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: optimal shard assignment with low shard key cardinality using compositeId
to enable shard splitting
> 
> Charles:
> 
> You raise good points, and I didn't mean to say that co-locating docs due to some critera
was never a good idea. That said, it does add administrative complexity that I'd prefer to
avoid unless necessary.
> 
> I suppose it largely depends on what the load and response SLAs are.
> If there's 1 query/second peak load, the sharding overhead for queries is probably not
noticeable. If there are 1,000 QPS, then it might be worth it.
> 
> Measure, measure, measure......
> 
> I think your composite ID understanding is fine.
> 
> Best,
> Erick
> 
> On Thu, May 28, 2015 at 1:40 PM, Reitzel, Charles <Charles.Reitzel@tiaa-cref.org>
wrote:
>> We have used a similar sharding strategy for exactly the reasons you say.   But we
are fairly certain that the # of documents per user ID is < 5000 and, typically, <500.
  Thus, we think the overhead of distributed searches clearly outweighs the benefits.   Would
you agree?   We have done some load testing (with 100's of simultaneous users) and performance
has been good with data and queries distributed evenly across shards.
>> 
>> In Matteo's case, this model appears to apply well to user types B and C.    Not
sure about user type A, though.    At > 100,000 docs per user per year, on average, that
load seems ok for one node.   But, is it enough to benefit significantly from a parallel search?
>> 
>> With a 2 part composite ID, each part will contribute 16 bits to a 32 bit hash value,
which is then compared to the set of hash ranges for each active shard.   Since the user ID
will contribute the high-order bytes, it will dominate in matching the target shard(s).  
But dominance doesn't mean the lower order 16 bits will always be ignored, does it?   I.e.
if the original shard has been split, perhaps multiple times, isn't it possible that one user
IDs documents will be spread over a multiple shards?
>> 
>> In Matteo's case, it might make sense to specify fewer bits to the user ID for user
category A.   I.e. what I described above is the default for userId!docId.   But if you use
userId/8!docId/24 (8 bits for userId and 24 bits for the document ID), then couldn't one user's
docs might be split over multiple shards, even without splitting?
>> 
>> I'm just making sure I understand how composite ID sharding works correctly.   Have
I got it right?  Has any of this logic changed in 5.x?
>> 
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> Sent: Thursday, May 21, 2015 11:30 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: optimal shard assignment with low shard key cardinality 
>> using compositeId to enable shard splitting
>> 
>> I question your base assumption:
>> 
>> bq: So shard by document producer seems a good choice
>> 
>> Because what this _also_ does is force all of the work for a query onto one node
and all indexing for a particular producer ditto. And will cause you to manually monitor your
shards to see if some of them grow out of proportion to others. And....
>> 
>> I think it would be much less hassle to just let Solr distribute the docs as it may
based on the uniqueKey and forget about it. Unless you want, say, to do joins etc.... There
will, of course, be some overhead that you pay here, but unless you an measure it and it's
a pain I wouldn't add the complexity you're talking about, especially at the volumes you're
talking.
>> 
>> Best,
>> Erick
>> 
>> On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla <matteo.grolla@gmail.com> wrote:
>>> Hi
>>> I'd like some feedback on how I'd like to solve the following 
>>> sharding problem
>>> 
>>> 
>>> I have a collection that will eventually become big
>>> 
>>> Average document size is 1.5kb
>>> Every year 30 Million documents will be indexed
>>> 
>>> Data come from different document producers (a person, owner of his
>>> documents) and queries are almost always performed by a document 
>>> producer who can only query his own document. So shard by document 
>>> producer seems a good choice
>>> 
>>> there are 3 types of doc producer
>>> type A,
>>> cardinality 105 (there are 105 producers of this type) produce 17M 
>>> docs/year (the aggregated production af all type A producers) type B 
>>> cardinality ~10k produce 4M docs/year type C cardinality ~10M produce 
>>> 9M docs/year
>>> 
>>> I'm thinking about
>>> use compositeId ( solrDocId = producerId!docId ) to send all docs of the same
producer to the same shards. When a shard becomes too large I can use shard splitting.
>>> 
>>> problems
>>> -documents from type A producers could be oddly distributed among 
>>> shards, because hashing doesn't work well on small numbers (105) see 
>>> Appendix
>>> 
>>> As a solution I could do this when a new typeA producer (producerA1) arrives:
>>> 
>>> 1) client app: generate a producer code
>>> 2) client app: simulate murmurhashing and shard assignment
>>> 3) client app: check shard assignment is optimal (producer code is 
>>> assigned to the shard with the least type A producers) otherwise goto
>>> 1) and try with another code
>>> 
>>> when I add documents or perform searches for producerA1 I use it's 
>>> producer code respectively in the compositeId or in the route parameter What
do you think?
>>> 
>>> 
>>> -----------Appendix: murmurhash shard assignment
>>> simulation-----------------------
>>> 
>>> import mmh3
>>> 
>>> hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)]
>>> 
>>> num_shards = 16
>>> shards = [0]*num_shards
>>> 
>>> for hash in hashes:
>>>    idx = hash % num_shards
>>>    shards[idx] += 1
>>> 
>>> print shards
>>> print sum(shards)
>>> 
>>> -------------
>>> 
>>> result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]
>>> 
>>> so with 16 shards and 105 shard keys I can have shards with 1 key 
>>> shards with 11 keys
>>> 
>> 
>> **********************************************************************
>> *** This e-mail may contain confidential or privileged information.
>> If you are not the intended recipient, please notify the sender immediately and then
delete it.
>> 
>> TIAA-CREF
>> **********************************************************************
>> ***
> 
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately and then
delete it.
> 
> TIAA-CREF
> *************************************************************************


Mime
View raw message