lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérémie MONSINJON (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SOLR-8129) HdfsChaosMonkeyNothingIsSafeTest failures
Date Wed, 27 Jan 2016 18:09:40 GMT

    [ https://issues.apache.org/jira/browse/SOLR-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119905#comment-15119905
] 

Jérémie MONSINJON commented on SOLR-8129:
-----------------------------------------

I’m stressing a SolR cluster with Gatling for perf tests. 3 shards, 3 nodes per shard (1
leader, 2 replicas). 
About 300 updates per second using CloudSolrServer (about 100 per shard) and 40 qps.
Using round robin for select queries on each node.
After 2 minutes playing the scenario, it seems that the leader of a shard can’t send an
update to a replica.

REPLICA : 51
{quote}
2016-01-27 15:18:00.252 WARN  (qtp1057941451-128982) [   ] o.e.j.h.HttpParser badMessage:
java.lang.IllegalStateException: too much data after closed for HttpChannelOverHttp@428e80f{r=1098,c=false,a=IDLE,uri=-}
{quote}

LEADER : 33
{quote}
2016-01-27 15:18:00.245 ERROR (updateExecutor-2-thread-3557-processing-http:////solrNode051:8983//solr//offers_full_shard3_replica1
x:offers_full_shard3_replica3 r:core_node10 n:solrNode033:8983_solr s:shard3 c:offers_full)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.u.StreamingSolrClients
error
org.apache.http.NoHttpResponseException: solrNode051:8983 failed to respond
2016-01-27 15:18:00.267 INFO  (qtp1057941451-45551) [c:offers_full s:shard3 r:core_node10
x:offers_full_shard3_replica3] o.a.s.c.ZkController Put replica core=offers_full_shard3_replica1
coreNodeName=core_node3 on solrNode051:8983_solr into leader-initiated recovery.
{quote}

REPLICA : 51
{quote}
2016-01-27 15:18:00.276 INFO  (qtp1057941451-129018) [   ] o.a.s.h.a.CoreAdminHandler It has
been requested that we recover: core=offers_full_shard3_replica1
2016-01-27 15:18:00.277 INFO  (Thread-117036) [c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1]
o.a.s.u.DefaultSolrCoreState Running recovery - first canceling any ongoing recovery

2016-01-27 15:18:08.296 INFO  (updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) [c:offers_full s:shard3
r:core_node3 x:offers_full_shard3_replica1] o.a.s.c.RecoveryStrategy Attempting to PeerSync
from http://solrNode033:8983/solr/offers_full_shard3_replica3/ - recoveringAfterStartup=false

2016-01-27 15:18:08.296 ERROR (updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) [c:offers_full s:shard3
r:core_node3 x:offers_full_shard3_replica1] o.a.s.u.PeerSync PeerSync: core=offers_full_shard3_replica1
url=http://solrNode051:8983/solr ERROR, update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
tlog=tlog{file=/var/lib/solr5/data/offers_full_shard3_replica1/data/tlog/tlog.0000000000000004550
refcount=1}}
2016-01-27 15:18:08.297 WARN  (updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) [c:offers_full s:shard3
r:core_node3 x:offers_full_shard3_replica1] o.a.s.u.PeerSync PeerSync: core=offers_full_shard3_replica1
url=http://solrNode051:8983/solr too many updates received since start - startingUpdates no
longer overlaps with our currentUpdates

2016-01-27 15:18:15.160 WARN  (updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) [c:offers_full s:shard3
r:core_node3 x:offers_full_shard3_replica1] o.a.s.h.IndexFetcher File _d2r_1wk.liv did not
match. expected checksum is 3620017595 and actual is checksum 442433238. expected length is
38405 and actual length is 38405
2016-01-27 15:18:15.160 INFO  (updateExecutor-2-thread-4594-processing-n:solrNode051:8983_solr
x:offers_full_shard3_replica1 s:shard3 c:offers_full r:core_node3) [c:offers_full s:shard3
r:core_node3 x:offers_full_shard3_replica1] o.a.s.h.IndexFetcher Starting download (fullCopy=true)
to NRTCachingDirectory(MMapDirectory@/var/lib/solr5/data/offers_full_shard3_replica1/data/index.20160127151815143
lockFactory=org.apache.lucene.store.NativeFSLockFactory@721552d5; maxCacheMB=48.0 maxMergeSizeMB=4.0)
{quote}

REPLICA : 27
{quote}
2016-01-27 15:20:54.009 INFO  (zkCallback-3-thread-198-processing-n:solrNode027:8983_solr)
[   ] o.a.s.c.c.ZkStateReader A live node change: WatchedEvent state:Sync
Connected type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live nodes
size: 11)
2016-01-27 15:20:54.016 INFO  (zkCallback-3-thread-198-processing-n:solrNode027:8983_solr)
[c:offers_full s:shard3 r:core_node7 x:offers_full_shard3_replica2] o.a.s.c.ShardLeaderElectionContext
Running the leader process for shard shard3
{quote}

REPLICA : 51
{quote}
2016-01-27 15:20:54.011 INFO  (zkCallback-3-thread-110-processing-n:solrNode051:8983_solr)
[c:offers_full s:shard3 r:core_node3 x:offers_full_shard3_replica1] o.a.s.u.DefaultSolrCoreState
Running recovery - first canceling any ongoing recovery
{quote}	

REPLICA : 27
{quote}
BECAME THE LEADER
{quote}
{quote}
2016-01-27 15:20:56.525 INFO  (zkCallback-3-thread-198-processing-n:solrNode027:8983_solr)
[c:offers_full s:shard3 r:core_node7 x:offers_full_shard3_replica2] o.a.s.c.SyncStrategy Sync
Success - now sync replicas to me

2016-01-27 15:21:08.793 ERROR (qtp1057941451-43187) [c:offers_full s:shard3 r:core_node7 x:offers_full_shard3_replica2]
o.a.s.u.p.DistributedUpdateProcessor Request says it is coming from leader, but we are the
leader: update.distrib=FROMLEADER&distrib.from=http://solrNode033:8983/solr/offers_full_shard3_replica3/&wt=javabin&version=2
{quote}	

(OLD) LEADER : 33
{quote}
2016-01-27 15:21:08.801 ERROR (qtp1057941451-45871) [c:offers_full s:shard3 r:core_node10
x:offers_full_shard3_replica3] o.a.s.u.p.DistributedUpdateProcessor On core_node10, replica
http://solrNode027:8983/solr/offers_full_shard3_replica2/ now thinks it is the leader! Failing
the request to let the client retry! org.apache.solr.common.SolrException:Service Unavailable

2016-01-27 15:21:09.712 INFO  (zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[   ] o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was expired. Attempting to
reconnect to recover relationship with ZooKeeper...

2016-01-27 15:21:09.747 INFO  (zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ZkController
publishing state=down
2016-01-27 15:21:09.760 INFO  (zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ZkController
Replica core_node10 NOT in leader-initiated recovery, need to wait for leader to see down
state.
{quote}

(NEW) LEADER : 27
{quote}
2016-01-27 15:21:10.137 INFO  (zkCallback-3-thread-199-processing-n:solrNode027:8983_solr)
[   ] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/collections/offers_full/state.json for collection offers_full has occurred - updating...
(live nodes size: 10)
2016-01-27 15:21:10.139 INFO  (zkCallback-3-thread-199-processing-n:solrNode027:8983_solr)
[   ] o.a.s.c.c.ZkStateReader Updating data for offers_full from 1344 to 1345
2016-01-27 15:21:10.215 INFO  (qtp1057941451-43418) [   ] o.a.s.h.a.CoreAdminHandler Going
to wait for coreNodeName: core_node10, state: down, checkLive: null, onlyIfLeader: null, onlyIfLeaderActive:
null
{quote}

(OLD) LEADER : 33
{quote}
2016-01-27 15:21:10.220 INFO  (zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ElectionContext
cancelElection did not find election node to remove /overseer_elect/election/238989758703863355-solrNode033:8983_solr-n_0000000158
2016-01-27 15:21:10.221 INFO  (zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.LeaderElector
Joined leadership election with path: /overseer_elect/election/238989758703863364-solrNode033:8983_solr-n_0000000175
2016-01-27 15:21:10.238 INFO  (zkCallback-3-thread-36-processing-n:solrNode033:8983_solr-EventThread)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ZkController
Register node as live in ZooKeeper:/live_nodes/solrNode033:8983_solr
2016-01-27 15:21:10.290 INFO  (updateExecutor-2-thread-3542-processing-n:solrNode033:8983_solr
x:offers_full_shard3_replica3 s:shard3 c:offers_full r:core_node10) [c:offers_full s:shard3
r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.RecoveryStrategy Sending prep recovery
command to http://solrNode027:8983/solr; WaitForState: action=PREPRECOVERY&core=offers_full_shard3_replica2&nodeName=solrNode033:8983_solr&coreNodeName=core_node10&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
{quote}	

(NEW) LEADER : 27
{quote}
2016-01-27 15:21:10.329 INFO  (qtp1057941451-43519) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null
path=/admin/cores params={nodeName=solrNode033:8983_solr&onlyIfLeaderActive=true&core=offers_full_shard3_replica2&coreNodeName=core_node10&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2}
status=0 QTime=2
{quote}	

(OLD) LEADER : 33
{quote}
Process Recovering
{quote}
{quote}
2016-01-27 15:21:52.007 INFO  (zkCallback-3-thread-211-processing-n:solrNode033:8983_solr)
[   ] o.a.s.c.c.ZkStateReader A live node change: WatchedEvent state:SyncConnected type:NodeChildrenChanged
path:/live_nodes, has occurred - updating... (live nodes size: 11)
2016-01-27 15:21:52.008 INFO  (zkCallback-3-thread-210-processing-n:solrNode033:8983_solr)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ActionThrottle
The last leader attempt started 22366010ms ago.
2016-01-27 15:21:52.008 INFO  (zkCallback-3-thread-210-processing-n:solrNode033:8983_solr)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ShardLeaderElectionContext
Running the leader process for shard shard3
2016-01-27 15:21:52.007 INFO  (zkCallback-3-thread-211-processing-n:solrNode033:8983_solr)
[   ] o.a.s.c.c.ZkStateReader A live node change: WatchedEvent state:SyncConnected type:NodeChildrenChanged
path:/live_nodes, has occurred - updating... (live nodes size: 11)
2016-01-27 15:21:52.011 INFO  (zkCallback-3-thread-210-processing-n:solrNode033:8983_solr)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ShardLeaderElectionContext
Checking if I should try and be the leader.
2016-01-27 15:21:52.011 INFO  (zkCallback-3-thread-210-processing-n:solrNode033:8983_solr)
[c:offers_full s:shard3 r:core_node10 x:offers_full_shard3_replica3] o.a.s.c.ShardLeaderElectionContext
There may be a better leader candidate than us - going back into recovery
{quote}	

(NEW) LEADER : 27
{quote}
2016-01-27 15:21:53.211 INFO  (zkCallback-3-thread-73-processing-n:solrNode027:8983_solr-EventThread)
[   ] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.common.cloud.ConnectionManager@478ca953
name:ZooKeeperConnection Watcher:solrNode013:2181,solrNode014:2181,solrNode015:2181 got event
WatchedEvent state:Disconnected type:None path:null path:null type:None
2016-01-27 15:21:53.211 INFO  (zkCallback-3-thread-73-processing-n:solrNode027:8983_solr-EventThread)
[   ] o.a.s.c.c.ConnectionManager zkClient has disconnected	
{quote}

All this time, the replica 51 can't catch up and stay in recovering	state.
The old leader 33 stays in full recovering state because 27 is the new leader. During several
minutes, the node 33 tries to download all the segments but fail on every checksum ("expected
length is 137136 and actual length is 2096582" for example)
Because it cannot keep its zookeeper session, the 27 stays down because there are no leader
for the shard...

This behaviour happens every time I send too many QPS... Under 40 qps, we do not have any
problems...

Does it look similar to these ChaosMonkey failures ?

> HdfsChaosMonkeyNothingIsSafeTest failures
> -----------------------------------------
>
>                 Key: SOLR-8129
>                 URL: https://issues.apache.org/jira/browse/SOLR-8129
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Yonik Seeley
>         Attachments: fail.151005_064958, fail.151005_080319
>
>
> New HDFS chaos test in SOLR-8123 hits a number of types of failures, including shard
inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message