lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Douglas <rld...@cornell.edu>
Subject Re: Could not publish that recovery failed
Date Mon, 06 Apr 2020 12:57:49 GMT
Thanks Erick.

I think we already looked at GC and the Solr logs and nothing jumped out, but I'll let you
know if we get to the bottom of this.

On 4/3/20, 7:20 PM, "Erick Erickson" <erickerickson@gmail.com> wrote:

    Hmmmm. What this usually means is that the connection from the Solr instance to Zookeeper
somehow times out. The first thing I’d be looking at are my GC logs, both on my Solr instances
and my Zookeeper instances. If you have excessive stop-the-world times (15 seconds?) then
that’d be the first thing I’d look at.
    
    But I’ve seen these errors come on at various times that aren’t GC causes and never
quite known where to start determining the cause, it becomes lots of detective work.
    
    Oh, and be sure to look three places:
    - the Zookeeper logs (besides GC)
    - the Solr log on the leader of the shard with the replica that fails to recover
    - the Solr log on the node that’s failing to recover.
    
    
    Best,
    Erick
    
    > On Apr 3, 2020, at 11:52 AM, Robbie Douglas <rld244@cornell.edu> wrote:
    > 
    > Hello,
    > 
    > We had an outage on one of our Solr nodes that we are trying to figure out.
    > Here's what came up in the Solr admin logs. 3 separate ones that I think
    > were in this order, but maybe not.
    > 
    > Stopping recovery for core=[b1_shard5_replica_n16]
    > coreNodeName=[core_node19]
    > 
    > Error while trying to recover.
    > core=b1_shard5_replica_n16:org.apache.solr.common.SolrException: Error while
    > saving shard term for collection: b1
    >         at
    > org.apache.solr.cloud.ZkShardTerms.saveTerms(ZkShardTerms.java:307)
    >         at
    > org.apache.solr.cloud.ZkShardTerms.forceSaveTerms(ZkShardTerms.java:281)
    >         at
    > org.apache.solr.cloud.ZkShardTerms.startRecovering(ZkShardTerms.java:227)
    >         at
    > org.apache.solr.cloud.ZkController.publish(ZkController.java:1576)
    >         at
    > org.apache.solr.cloud.ZkController.publish(ZkController.java:1500)
    >         at
    > org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:577)
    >         at
    > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:326)
    >         at
    > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:307)
    >         at
    > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
    >         at
    > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    >         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    >         at
    > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
    >         at
    > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    >         at
    > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    >         at java.lang.Thread.run(Thread.java:745)
    > 
    > Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
    > KeeperErrorCode = Session expired for /collections/b1/terms/shard5
    >         at
    > org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
    >         at
    > org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
    >         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1336)
    >         at
    > org.apache.solr.common.cloud.SolrZkClient.lambda$setData$6(SolrZkClient.java:370)
    >         at
    > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
    >         at
    > org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:370)
    >         at
    > org.apache.solr.cloud.ZkShardTerms.saveTerms(ZkShardTerms.java:297)
    >         ... 14 more
    > 
    > Could not publish that recovery
    > failed:org.apache.zookeeper.KeeperException$SessionExpiredException:
    > KeeperErrorCode = Session expired for /overseer/queue
    >         at
    > org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
    >         at
    > org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
    >         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1111)
    >         at
    > org.apache.solr.common.cloud.SolrZkClient.lambda$exists$2(SolrZkClient.java:322)
    >         at
    > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
    >         at
    > org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:322)
    >         at
    > org.apache.solr.cloud.ZkDistributedQueue.offer(ZkDistributedQueue.java:309)
    >         at
    > org.apache.solr.cloud.ZkController.publish(ZkController.java:1587)
    >         at
    > org.apache.solr.cloud.ZkController.publish(ZkController.java:1500)
    >         at
    > org.apache.solr.cloud.RecoveryStrategy.recoveryFailed(RecoveryStrategy.java:190)
    >         at
    > org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:715)
    >         at
    > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:326)
    >         at
    > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:307)
    >         at
    > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
    >         at
    > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    >         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    >         at
    > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
    >         at
    > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    >         at
    > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    >         at java.lang.Thread.run(Thread.java:745)
    > 
    > 
    > Solr is 8.1.1 with Zookeeper 3.4.9 deployed on the same nodes.
    > 
    > Solr config looks like this.
    > 
    > -DSTOP.KEY=solrrocks
    > -DSTOP.PORT=7983
    > -Dcom.sun.management.jmxremote
    > -Dcom.sun.management.jmxremote.authenticate=false
    > -Dcom.sun.management.jmxremote.local.only=false
    > -Dcom.sun.management.jmxremote.port=18983
    > -Dcom.sun.management.jmxremote.rmi.port=18983
    > -Dcom.sun.management.jmxremote.ssl=false
    > -Djetty.home=/cul/app/solr/solr/server
    > -Djetty.port=8983
    > -Dlog4j.configurationFile=file:/cul/data/solr/log4j2.xml
    > -Dsolr.data.home=
    > -Dsolr.default.confdir=/cul/app/solr/solr/server/solr/configsets/_default/conf
    > -Dsolr.install.dir=/cul/app/solr/solr
    > -Dsolr.jetty.https.port=8983
    > -Dsolr.log.dir=/cul/data/solr/logs
    > -Dsolr.log.muteconsole
    > -Dsolr.solr.home=/cul/data/solr/data
    > -Duser.timezone=UTC
    > -DzkClientTimeout=15000
    > -DzkHost=zk-host1:2181, zk-host2:2181, zk-host3:2181
    > -XX:+AlwaysPreTouch
    > -XX:+ParallelRefProcEnabled
    > -XX:+PerfDisableSharedMem
    > -XX:+PrintGCApplicationStoppedTime
    > -XX:+PrintGCDateStamps
    > -XX:+PrintGCDetails
    > -XX:+PrintGCTimeStamps
    > -XX:+PrintHeapAtGC
    > -XX:+PrintTenuringDistribution
    > -XX:+UseG1GC
    > -XX:+UseGCLogFileRotation
    > -XX:+UseLargePages
    > -XX:GCLogFileSize=20M
    > -XX:MaxGCPauseMillis=250
    > -XX:NumberOfGCLogFiles=9
    > -XX:OnOutOfMemoryError=/cul/app/solr/solr/bin/oom_solr.sh 8983
    > /cul/data/solr/logs
    > -Xloggc:/cul/data/solr/logs/solr_gc.log
    > -Xms8g
    > -Xmx8g
    > -Xss256k
    > -verbose:gc
    > 
    > 
    > Any ideas on what to keep an eye on that would cause this would be greatly
    > appreciated.
    > 
    > Thanks,
    > Robbie
    > 
    > 
    > 
    > 
    > 
    > --
    > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
    
    

Mime
View raw message