lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sudhakar Maddineni <maddineni...@gmail.com>
Subject Re: SolrCloud - ClusterState says we are the leader,but locally ...
Date Fri, 07 Dec 2012 23:55:09 GMT
We saw this error again today during our load test - basically, whenever
session is getting expired on the leader node, we are seeing the
error.After this happens, leader(001) is going into 'recovery' mode and all
the index updates are failing with "503- service unavailable" error
message.After some time(once recovery is successful), roles are swapped
i.e. 001 acting as the replica and 003 as leader.

Btw, do you know why the connection to zookeeper[solr->zk] getting
interrupted in the middle?
is it because of the load(no of updates) we are putting on the cluster?

Here is the exception stack trace:

*Dec* *7*, *2012* *2:28:03* *PM*
*org.apache.solr.cloud.Overseer$ClusterStateUpdater*
*amILeader**WARNING:*
*org.apache.zookeeper.KeeperException$SessionExpiredException:*
*KeeperErrorCode* *=* *Session* *expired* *for*
*/overseer_elect/leader*
	*at* *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
	*at* *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
	*at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*)
	*at* *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
	*at* *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
	*at* *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
	*at* *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
	*at* *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
	*at* *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
	*at* *java.lang.Thread.run*(*Unknown* *Source*)

Thx,Sudhakar.



On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni
<maddinenidev@gmail.com>wrote:

> Erick:
>   Not seeing any page caching related issues...
>
> Mark:
>   1.Would this "waiting" on 003(replica) cause any inconsistencies in the
> zookeeper cluster state? I was also looking at the leader(001) logs at that
> time and seeing errors related to "*SEVERE: ClusterState says we are the
> leader, but locally we don't think so*".
>   2.Also, all of our servers in cluster were gone down when the index
> updates were running in parallel along with this issue.Do you see this
> related to the session expiry on 001?
>
> Here are the logs on 001
> =========================
>
> Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
> amILeader
> WARNING:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/leader
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
> Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
> amILeader
> INFO: According to ZK I
> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a leader.
>
> Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.OverseerCollectionProcessor
> run
> WARNING: Overseer cannot talk to ZK
>
> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem finding the leader in
> zk:org.apache.solr.common.SolrException: Could not get leader props
>  at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>  Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem finding the leader in
> zk:org.apache.solr.common.SolrException: Could not get leader props
>  at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>  Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem making a request to the
> leader:org.apache.solr.common.SolrException: I was asked to wait on state
> down for <001>:8080_solr but I still do not see the request state. I see
> state: active live:true
>  at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>  Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem making a request to the
> leader:org.apache.solr.common.SolrException: I was asked to wait on state
> down for <001>:8080_solr but I still do not see the request state. I see
> state: active live:true
>  at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> ....
>  ....
> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem finding the leader in
> zk:org.apache.solr.common.SolrException: Could not get leader props
>  at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> ....
>  ....
> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
> SEVERE: :org.apache.solr.common.SolrException: There was a problem finding
> the leader in zk
>  at
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
> at
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
>  Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
> SEVERE: Error getting leader from zk
> org.apache.solr.common.SolrException: *There is conflicting information
> about the leader of shard: shard1 our state says:http://<001>:8080/solr/core1/
> but zookeeper says:http://<003>:8080/solr/core1/*
> * at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
>  Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> INFO: Running the leader process.
>  ....
> ....
>
> Thanks for your inputs.
> Sudhakar.
>
>
>
>
>
>
>
> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmiller@gmail.com> wrote:
>
>> Yes - it means that 001 went down (or more likely had it's connection to
>> ZooKeeper interrupted? that's what I mean about a session timeout - if the
>> solr->zk link is broken for longer than the session timeout that will
>> trigger a leader election and when the connection is reestablished, the
>> node will have to recover). That waiting should stop as soon as 001 came
>> back up or reconnected to ZooKeeper.
>>
>> In fact, this waiting should not happen in this case - but only on
>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming very
>> soon!):
>>
>> * SOLR-3940: Rejoining the leader election incorrectly triggers the code
>> path
>>   for a fresh cluster start rather than fail over. (Mark Miller)
>>
>> - Mark
>>
>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <maddinenidev@gmail.com>
>> wrote:
>>
>> > Yep, after restarting, cluster came back to normal state.We will run
>> couple of more tests and see if we could reproduce this issue.
>> >
>> > Btw, I am attaching the server logs before that 'INFO: Waiting until we
>> see more replicas'  message.From the logs, we can see that leader election
>> process started on 003 which was the replica for 001 initially.That means
>> leader 001 went down at that time?
>> >
>> > logs on 003:
>> > ========
>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> >         INFO: Running the leader process.
>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> shouldIBeLeader
>> >         INFO: Checking if I should try and be the leader.
>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> shouldIBeLeader
>> >         INFO: My last published State was Active, it's okay to be the
>> leader.
>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> >         INFO: I may be the new leader - try and sync
>> > 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>> >         WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core
>> core=core1.
>> > 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>> >         INFO: Sync replicas to http://<003>:8080/solr/core1/
>> > 12:11:16 PM org.apache.solr.update.PeerSync sync
>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>> > 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>> >         INFO: Updating live nodes -> this message is on 002
>> > 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>> >         WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>>  exception talking to <001>:8080/solr/core1/, failed
>> >         org.apache.solr.client.solrj.SolrServerException: Timeout
>> occured while waiting response from server at: <001>:8080/solr/core1
>> >               at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>> >               at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >               at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>> >               at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>> Source)
>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
>> >               at
>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>> Source)
>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
>> >               at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>> >               at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> >               at java.lang.Thread.run(Unknown Source)
>> >         Caused by: java.net.SocketTimeoutException: Read timed out
>> >               at java.net.SocketInputStream.socketRead0(Native Method)
>> >               at java.net.SocketInputStream.read(Unknown Source)
>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
>> sync failed
>> > 12:11:46 PM org.apache.solr.common.SolrException log
>> >         SEVERE: Sync Failed
>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> rejoinLeaderElection
>> >         INFO: There is a better leader candidate than us - going back
>> into recovery
>> > 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>> >         INFO: Running recovery - first canceling any ongoing recovery
>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>> >         INFO: Starting recovery process.  core=core1
>> recoveringAfterStartup=false
>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>> >         INFO: Attempting to PeerSync from <001>:8080/solr/core1/
>> core=core1 - recoveringAfterStartup=false
>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> >         INFO: Running the leader process.
>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=179999
>> > 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=179495
>> > 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=178985
>> > ....
>> > ....
>> >
>> > Thanks for your help.
>> > Sudhakar.
>> >
>> > On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <markrmiller@gmail.com>
>> wrote:
>> > The waiting logging had to happen on restart unless it's some kind of
>> bug.
>> >
>> > Beyond that, something is off, but I have no clue why - it seems your
>> clusterstate.json is not up to date at all.
>> >
>> > Have you tried restarting the cluster then? Does that help at all?
>> >
>> > Do you see any exceptions around zookeeper session timeouts?
>> >
>> > - Mark
>> >
>> > On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <maddinenidev@gmail.com>
>> wrote:
>> >
>> > > Hey Mark,
>> > >
>> > > Yes, I am able to access all of the nodes under each shard from
>> solrcloud
>> > > admin UI.
>> > >
>> > >
>> > >   - *It kind of looks like the urls solrcloud is using are not
>> accessible.
>> > >   When you go to the admin page and the cloud tab, can you access the
>> urls it
>> > >   shows for each shard? That is, if you click on of the links or copy
>> and
>> > >   paste the address into a web browser, does it work?*
>> > >
>> > > Actually, I got these errors when my document upload task/job was
>> running,
>> > > not during the cluster restart. Also,job ran fine initially for the
>> first
>> > > one hour and started throwing these errors after indexing some docx.
>> > >
>> > > Thx, Sudhakar.
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmiller@gmail.com>
>> wrote:
>> > >
>> > >> It kind of looks like the urls solrcloud is using are not
>> accessible. When
>> > >> you go to the admin page and the cloud tab, can you access the urls
>> it
>> > >> shows for each shard? That is, if you click on of the links or copy
>> and
>> > >> paste the address into a web browser, does it work?
>> > >>
>> > >> You may have to explicitly set the host= in solr.xml if it's not auto
>> > >> detecting the right one. Make sure the ports like right too.
>> > >>
>> > >>> waitForReplicasToComeUp
>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
>> > >>> timeoutin=179999
>> > >>
>> > >> That happens when you stop the cluster and try to start it again -
>> before
>> > >> a leader is chosen, it will wait for all known replicas fora shard
>> to come
>> > >> up so that everyone can sync up and have a chance to be the best
>> leader. So
>> > >> at this point it was only finding one of 2 known replicas and
>> waiting for
>> > >> the second to come up. After a couple minutes (configurable) it will
>> just
>> > >> continue anyway without the missing replica (if it doesn't show up).
>> > >>
>> > >> - Mark
>> > >>
>> > >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
>> maddinenidev@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> Hi,
>> > >>> We are uploading solr documents to the index in batches using 30
>> threads
>> > >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
>> set to
>> > >>> 10000.
>> > >>> In the code, we are using HttpSolrServer and add(inputDoc) method
>> to add
>> > >>> docx.
>> > >>> And, we have the following commit settings in solrconfig:
>> > >>>
>> > >>>    <autoCommit>
>> > >>>      <maxTime>300000</maxTime>
>> > >>>      <maxDocs>10000</maxDocs>
>> > >>>      <openSearcher>false</openSearcher>
>> > >>>    </autoCommit>
>> > >>>
>> > >>>      <autoSoftCommit>
>> > >>>        <maxTime>1000</maxTime>
>> > >>>      </autoSoftCommit>
>> > >>>
>> > >>> Cluster Details:
>> > >>> ----------------------------
>> > >>> solr version - 4.0
>> > >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>> > >>> numshards=2 ,
>> > >>> 001, 002, 003 are the solr nodes and these three are behind the
>> > >>> loadbalancer  <vip>
>> > >>> 001, 003 assigned to shard1; 002 assigned to shard2
>> > >>>
>> > >>>
>> > >>> Logs:Getting the errors in the below sequence after uploading some
>> docx:
>> > >>>
>> > >>
>> -----------------------------------------------------------------------------------------------------------
>> > >>> 003
>> > >>> Dec 4, 2012 12:11:46 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> > >>> waitForReplicasToComeUp
>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
>> > >>> timeoutin=179999
>> > >>>
>> > >>> 001
>> > >>> Dec 4, 2012 12:12:59 PM
>> > >>> org.apache.solr.update.processor.DistributedUpdateProcessor
>> > >>> doDefensiveChecks
>> > >>> SEVERE: ClusterState says we are the leader, but locally we don't
>> think
>> > >> so
>> > >>>
>> > >>> 003
>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>> > >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed
-
>> retrying ...
>> > >>>
>> > >>> 001
>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>> > >>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
>> Server at
>> > >>> <vip>/solr/core1. returned non ok status:503, message:Service
>> Unavailable
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> > >>> 001
>> > >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>> > >>> SEVERE: Error while trying to recover.
>> > >>> core=core1:org.apache.solr.common.SolrException: We are not the
>> leader
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>> > >>>
>> > >>> 001
>> > >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>> > >>> SEVERE: Error uploading:
>> > >> org.apache.solr.client.solrj.SolrServerException:
>> > >>> IOException occured when talking to server at <vip>/solr/core1
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>> > >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>> > >>> ... 5 lines omitted ...
>> > >>> at java.lang.Thread.run(Unknown Source)
>> > >>> Caused by: java.net.SocketException: Connection reset
>> > >>>
>> > >>>
>> > >>> After sometime, all the three servers are going down.
>> > >>>
>> > >>> Appreciate, if someone could let us know what we are missing.
>> > >>>
>> > >>> Thx,Sudhakar.
>> > >>
>> > >>
>> >
>> >
>> > <logs_error.txt>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message