lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: SOlr 3.5 and sharding
Date Thu, 17 Jan 2013 23:43:45 GMT
Hmmm, Maybe I'm finally getting it.

Right, that does seem odd. I would expect you to get 4x the number of
docs on any particular shard/replica in this situation.

What happens you look at the Solr logs for each partition? You should
be able to glean the num results from the logs. I guess there are a
couple of possibilities
1> each machine actually returns N documents, but the aggregator does
something weird and gives you < 4X. Indicating something's peculiar
with the Solr aggregation.
2> you find that, for some reason, you aren't getting the same count
_at the server level_, indicating your assertion that "all the indexes
are identical" isn't valid.

All of which means I'm pretty much out of ideas, it's hunt-and-seek time.

Erick

On Thu, Jan 17, 2013 at 10:53 AM, Jean-Sebastien Vachon
<jean-sebastien.vachon@wantedanalytics.com> wrote:
> Hi Erick,
>
> It looks like we are saying the exact same thing but with different terms ;)
> I looked at the Solr glossary and you might be right.. maybe I should talk about partitions
instead of shards.
>
> Since my last message, I`ve configured the replication between the master and slave and
everything is working fine except for my original question about the number of documents not
matching my expectations.
>
> I`ll try to clarify a few things and come back to this question...
>
> Machine A (which I called the master node) is where the indexation takes place.
> It consist of four Solr instances that will (eventually ) contain  1/4 of the entire
collection. It`s just that, at this moment, since I have no control on which partition a given
document is sent, I made copies of the same index for all partitions. Each Solr instance 
has a replication handler configured. I will eventually get to the point of changing the indexation
code to distribute documents evenly on all partitions but the person who can give me access
to this portion is not available right now so I can do nothing about it.
>
> Machine B has the same four shards setup to be replicas of the corresponding shard on
machine A.
> Machine B also contains another Solr instance with the default handler configured to
use the four local partitions. This instance receives client`s requests, collect the results
from each partition and then select the best matches to form the final response. We intent
to add new slaves being exact copies of Machine B and load balance clients requests on all
slaves.
>
> My original question was that if each partition has 1000 documents matching a certain
keyword and that I know all partitions have the same content then I was expecting to receive
4*1000 documents for the same keyword. But that is not the case.
> The replication is not an issue here since the same request on the master node will give
me the same result.
>
> Each shard when called individually will give 1000 documents. But when I call them using
the shards=xxx parameters then I am getting a little less than 4000 documents. I was just
curious to know why this was happening... Is this a bug? Or something I am misunderstanding...
>
> Thanks for your time and contribution to Solr!
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: January-17-13 8:46 AM
> To: solr-user@lucene.apache.org
> Subject: Re: SOlr 3.5 and sharding
>
> You're still confusing shards (or at least mixing up the terminology) with simple replication.
Shards are when you split up the index into several sub indexes and configure the sub-indexes
to "know about each other". Say you have 1M docs in 2 shards. 500K of them would go on one
shard and 500K on the other. But logically you have a single index of 1M docs. So the two
shards have to know about each other and when you send a request to one of them, it automatically
queries the other (as well as itself), collects the response and combines them, returning
the top N to the requester.
>
> This is totally different from replication. In replication (master/slave), each node
has all 1M documents. Each node can work totally in isolation. An incoming request is handled
by the slave without contacting any other node.
>
> If you're copying around indexes AND configuring them as though they were shards, each
request will be distributed to all shards and the results collated, giving you the same doc
repeatedly in your result set.
>
> If you have no access to the indexing code, you really can't go to a sharded setup.
>
> Polling is when the slaves periodically ask the master "has anything changed"? If so
then the slave pulls down the changes. The polling interval is configured in solrconfig.xml
_on the slave_. So let's say you index docs to the master. For some interval, until the slaves
poll the master and get an updated index, the number of searchable docs on the master will
be different than for the slaves. Additionally, you may have the issue of the polling intervals
for the slaves being offset from one another, so for some brief interval the counts on the
slaves may be different as well.
>
> Best
> Erick
>
> On Tue, Jan 15, 2013 at 10:18 AM, Jean-Sebastien Vachon <jean-sebastien.vachon@wantedanalytics.com>
wrote:
>> Ok I see what Erick`s meant now.. Thanks.
>>
>> The original index I`m working on contains about 120k documents. Since I have no
access to the code that pushes documents into the index, I made four copies of the same index.
>>
>> The master node contains no data at all, it simply use the data available in its
four shards. Knowing that I have 1000 documents matching the keyword "java" on each shard
I was expecting to receive 4000 documents out of my sharded setup. There are only a few documents
that are not accounted for (The result count is about 3996 which is pretty close but not accurate).
>>
>> Right now, the index is static so there is no need for any replication so the polling
interval has no effect.
>> Later this week, I will configure the replication and have the indexation modified
to  distribute the documents to each shard using a simple ID modulo 4 rule.
>>
>> Were my expectations wrong about the number  of documents?
>>
>> -----Original Message-----
>> From: Upayavira [mailto:uv@odoko.co.uk]
>> Sent: January-15-13 9:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOlr 3.5 and sharding
>>
>> He was referring to master/slave setup, where a slave will poll the master periodically
asking for index updates. That frequency is configured in solrconfig.xml on the slave.
>>
>> So, you are saying that you have, say 1m documents in your master index.
>> You then copy your index to four other boxes. At that point you have 1m documents
on each of those four. Eventually, you'll delete some docs, so'd you have 250k on each. You're
wondering, before the deletes, you're not seeing 1m docs on each of your instances.
>>
>> Or are you wondering why you're not seeing 1m docs when you do a distributed query
across all for of these boxes?
>>
>> Is that correct?
>>
>> Upayavira
>>
>> On Tue, Jan 15, 2013, at 02:11 PM, Jean-Sebastien Vachon wrote:
>>> Hi Erick,
>>>
>>> Thanks for your comments but I am migrating an existing index (single
>>> instance) to a sharded setup and currently I have no access to the
>>> code involved in the indexation process. That`s why I made a simple
>>> copy of the index on each shards.
>>>
>>> In the end, the data will be distributed among all shards.
>>>
>>> I was just curious to know why I had not the expected number of
>>> documents with my four shards.
>>>
>>> Can you elaborate on  this "polling interval" thing? I am pretty sure
>>> I never eared about this...
>>>
>>> Regards
>>>
>>> -----Original Message-----
>>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>>> Sent: January-15-13 8:00 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: SOlr 3.5 and sharding
>>>
>>> You're confusing shards and slaves here. Shards are splitting a
>>> logical index amongst N machines, where each machine contains a
>>> portion of the index. In that setup, you have to configure the slaves
>>> to know about the other shards, and the incoming query has to be
>>> distributed amongst all the shards to find all the docs.
>>>
>>> In your case, since you're really replicating (rather than sharding),
>>> you only have to query _one_ slave, the query doesn't need to be distributed.
>>>
>>> So pull all the sharding stuff out of your config files, put a load
>>> balancer in front of your slaves and only send the request to one of
>>> them would be the place I'd start.
>>>
>>> Also, don't be at all surprised if the number of hits from the
>>> _master_ (which you shouldn't be searching, BTW) is different than
>>> the slaves, there's the polling interval to consider.
>>>
>>> Best
>>> Erick
>>>
>>>
>>> On Mon, Jan 14, 2013 at 9:58 AM, Jean-Sebastien Vachon <
>>> jean-sebastien.vachon@wantedanalytics.com> wrote:
>>>
>>> > Hi,
>>> >
>>> > I`m setting up a small Sorl setup consisting of 1 master node and 4
>>> > shards. For now, all four shards contains the exact same data. When
>>> > I perform a query on each individual shards for the word `java` I
>>> > am receiving the same number of docs (as expected). However, when I
>>> > am going through the master node using the shards parameters, the
>>> > number of results is slightly off by a few documents. There is
>>> > nothing special in my setup so I`m looking for hints on why I am
>>> > getting this problem
>>> >
>>> > Thanks
>>> >
>>>
>>> -----
>>> Aucun virus trouvé dans ce message.
>>> Analyse effectuée par AVG - www.avg.fr
>>> Version: 2013.0.2890 / Base de données virale: 2638/6032 - Date:
>>> 14/01/2013
>>
>> -----
>> Aucun virus trouvé dans ce message.
>> Analyse effectuée par AVG - www.avg.fr
>> Version: 2013.0.2890 / Base de données virale: 2638/6032 - Date:
>> 14/01/2013
>
> -----
> Aucun virus trouvé dans ce message.
> Analyse effectuée par AVG - www.avg.fr
> Version: 2013.0.2890 / Base de données virale: 2638/6037 - Date: 16/01/2013

Mime
View raw message