helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Craig Murphey (Jira)" <j...@apache.org>
Subject [jira] [Created] (HELIX-822) OnlineOffline cluster stops rebalancing
Date Thu, 05 Mar 2020 19:40:00 GMT
Craig Murphey created HELIX-822:
-----------------------------------

             Summary: OnlineOffline cluster stops rebalancing
                 Key: HELIX-822
                 URL: https://issues.apache.org/jira/browse/HELIX-822
             Project: Apache Helix
          Issue Type: Bug
          Components: helix-core
    Affects Versions: 0.8.x
            Reporter: Craig Murphey
         Attachments: Screen Shot 2020-03-05 at 11.28.53 AM.png

We recently upgraded our controller to use 0.8.4, then downgraded it back to 0.8.2.   After
this and after some time after a controller is elected master, we've seen our LiveInstanceChangeListener
not get called for a live instance update.

On the controller, we have a thread that's spun up on controller start that constantly logs
the external state and it sees the instance count decrease.

At the same time as the expected notification to the listener, we do see a large amount of
zknodes being created and deleted.

!Screen Shot 2020-03-05 at 11.28.53 AM.png!

Upon inspection of our instances with helix-admin.sh, we found we have many more instances,
than we have live-instances (20 live instance, 60-100 instances).  This is because we register
the participant with hostname, which can change over time.

Looking into these instances, we found many of the non-live instances have many messages left
over.

We are able to mitigate the issue by restarting the master controller manually.

How do left over instances affect the overall cluster health?  Is it possible that the controller
is trying to tell offline instances that their resource is dropped, which is preventing the
controller from issuing the live instance change event?

Here's a snapshot of what we saw in zk:

 
{noformat}
So, in DCA, there are a lot of messages in Zookeeper for instances that are not live ->

$ zkcli -h dlmzk ls /DLM/INSTANCES | awk -F \' '{print $2}' | while read host; do echo -n
"$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES | awk -F \' '{print $2}' | grep
-v "^$" | wc -l ;done | sort -nk 3
agent1016-dca1_8274 : 0
agent1053-dca1_8274 : 0
agent1100-dca1_8274 : 0
agent1346-dca1_8274 : 0
agent1397-dca1_8274 : 0
agent1406-dca1_8274 : 0
agent1412-dca1_8274 : 0
agent1549-dca1_8274 : 0
agent1558-dca1_8274 : 0
agent1573-dca1_8274 : 0
agent1584-dca1_8274 : 0
agent211-dca1_8274 : 0
agent2124-dca1_8274 : 0
agent2148-dca1_8274 : 0
agent2149-dca1_8274 : 0
agent2153-dca1_8274 : 0
agent2184-dca1_8274 : 0
agent21-dca1_8274 : 0
agent2287-dca1_8274 : 0
agent2713-dca1_8274 : 0
agent2763-dca1_8274 : 0
agent27-dca1_8274 : 0
agent2878-dca1_8274 : 0
agent2900-dca1_8274 : 0
agent2930-dca1_8274 : 0
agent31-dca1_8274 : 0
agent3372-dca1_8274 : 0
agent3376-dca1_8274 : 0
agent3435-dca1_8274 : 0
agent3436-dca1_8274 : 0
agent3473-dca1_8274 : 0
agent3543-dca1_8274 : 0
agent3564-dca1_8274 : 0
agent3572-dca1_8274 : 0
agent3601-dca1_8274 : 0
agent3646-dca1_8274 : 0
agent3647-dca1_8274 : 0
agent3648-dca1_8274 : 0
agent3651-dca1_8274 : 0
agent3671-dca1_8274 : 0
agent3677-dca1_8274 : 0
agent3678-dca1_8274 : 0
agent3699-dca1_8274 : 0
agent3714-dca1_8274 : 0
agent3726-dca1_8274 : 0
agent3991-dca1_8274 : 0
agent4070-dca1_8274 : 0
agent4096-dca1_8274 : 0
agent4121-dca1_8274 : 0
agent4545-dca1_8274 : 0
agent4581-dca1_8274 : 0
agent4601-dca1_8274 : 0
agent4612-dca1_8274 : 0
agent4649-dca1_8274 : 0
agent4650-dca1_8274 : 0
agent4651-dca1_8274 : 0
agent4664-dca1_8274 : 0
agent4672-dca1_8274 : 0
agent4678-dca1_8274 : 0
agent46-dca1_8274 : 0
agent4702-dca1_8274 : 0
agent4722-dca1_8274 : 0
agent4726-dca1_8274 : 0
agent4729-dca1_8274 : 0
agent4730-dca1_8274 : 0
agent5233-dca1_8274 : 0
agent5261-dca1_8274 : 0
agent5284-dca1_8274 : 0
agent63-dca1_8274 : 0
agent6444-dca1_8274 : 0
agent79-dca1_8274 : 0
agent83-dca1_8274 : 0
agent84-dca1_8274 : 0
agent90-dca1_8274 : 0
appdocker1204-dca1_8274 : 0
appdocker1454-dca1_8274 : 0
appdocker1858-dca1_8274 : 0
appdocker1950-dca1_8274 : 0
appdocker1966-dca1_8274 : 0
appdocker1970-dca1_8274 : 0
appdocker1985-dca1_8274 : 0
appdocker2012-dca1_8274 : 0
appdocker2046-dca1_8274 : 0
appdocker255-dca1_8274 : 0
appdocker30-dca1_8274 : 0
appdocker507-dca1_8274 : 0
appdocker568-dca1_8274 : 0
appdocker580-dca1_8274 : 0
appdocker61-dca1_8274 : 0
appdocker661-dca1_8274 : 0
appdocker693-dca1_8274 : 0
appdocker77-dca1_8274 : 0
appdocker791-dca1_8274 : 0
appdocker874-dca1_8274 : 0
appdocker909-dca1_8274 : 0
appdocker949-dca1_8274 : 0
compute1699-dca1_8274 : 0
compute2072-dca1_8274 : 0
compute228-dca1_8274 : 0
compute2527-dca1_8274 : 0
compute2541-dca1_8274 : 0
compute2579-dca1_8274 : 0
compute2608-dca1_8274 : 0
compute2792-dca1_8274 : 0
compute2822-dca1_8274 : 0
compute2842-dca1_8274 : 0
compute2849-dca1_8274 : 0
compute2862-dca1_8274 : 0
compute2928-dca1_8274 : 0
compute2937-dca1_8274 : 0
compute2946-dca1_8274 : 0
compute295-dca1_8274 : 0
compute2964-dca1_8274 : 0
compute2999-dca1_8274 : 0
compute3026-dca1_8274 : 0
compute3045-dca1_8274 : 0
compute3209-dca1_8274 : 0
compute3217-dca1_8274 : 0
compute3244-dca1_8274 : 0
compute3247-dca1_8274 : 0
compute3363-dca1_8274 : 0
compute3373-dca1_8274 : 0
compute3383-dca1_8274 : 0
compute3385-dca1_8274 : 0
compute3391-dca1_8274 : 0
compute3413-dca1_8274 : 0
compute3449-dca1_8274 : 0
compute3452-dca1_8274 : 0
compute3525-dca1_8274 : 0
compute3526-dca1_8274 : 0
compute3530-dca1_8274 : 0
compute3546-dca1_8274 : 0
compute3571-dca1_8274 : 0
compute3584-dca1_8274 : 0
compute3600-dca1_8274 : 0
compute3621-dca1_8274 : 0
compute3678-dca1_8274 : 0
compute3691-dca1_8274 : 0
compute3695-dca1_8274 : 0
compute36-dca1_8274 : 0
compute3750-dca1_8274 : 0
compute3770-dca1_8274 : 0
compute3809-dca1_8274 : 0
compute3846-dca1_8274 : 0
compute3857-dca1_8274 : 0
compute3919-dca1_8274 : 0
compute3985-dca1_8274 : 0
compute4033-dca1_8274 : 0
compute4036-dca1_8274 : 0
compute4103-dca1_8274 : 0
compute4141-dca1_8274 : 0
compute4161-dca1_8274 : 0
compute4191-dca1_8274 : 0
compute4239-dca1_8274 : 0
compute42-dca1_8274 : 0
compute4305-dca1_8274 : 0
compute4339-dca1_8274 : 0
compute4396-dca1_8274 : 0
compute4474-dca1_8274 : 0
compute4502-dca1_8274 : 0
compute4532-dca1_8274 : 0
compute4548-dca1_8274 : 0
compute4716-dca1_8274 : 0
compute4764-dca1_8274 : 0
compute4817-dca1_8274 : 0
compute4873-dca1_8274 : 0
compute4887-dca1_8274 : 0
compute4900-dca1_8274 : 0
compute4924-dca1_8274 : 0
compute4962-dca1_8274 : 0
compute4966-dca1_8274 : 0
compute4967-dca1_8274 : 0
compute4980-dca1_8274 : 0
compute4994-dca1_8274 : 0
compute4998-dca1_8274 : 0
compute5303-dca1_8274 : 0
compute5338-dca1_8274 : 0
compute5659-dca1_8274 : 0
compute5661-dca1_8274 : 0
compute5675-dca1_8274 : 0
compute5698-dca1_8274 : 0
compute5710-dca1_8274 : 0
compute5933-dca1_8274 : 0
compute5978-dca1_8274 : 0
compute6011-dca1_8274 : 0
compute6034-dca1_8274 : 0
compute6089-dca1_8274 : 0
compute6269-dca1_8274 : 0
compute6339-dca1_8274 : 0
compute6358-dca1_8274 : 0
compute6366-dca1_8274 : 0
compute6432-dca1_8274 : 0
compute6716-dca1_8274 : 0
compute6717-dca1_8274 : 0
compute6767-dca1_8274 : 0
compute6791-dca1_8274 : 0
compute6825-dca1_8274 : 0
compute6892-dca1_8274 : 0
compute68-dca1_8274 : 0
compute6905-dca1_8274 : 0
compute6937-dca1_8274 : 0
compute6992-dca1_8274 : 0
compute6994-dca1_8274 : 0
compute7029-dca1_8274 : 0
compute7179-dca1_8274 : 0
compute73-dca1_8274 : 0
compute7582-dca1_8274 : 0
compute7586-dca1_8274 : 0
compute7601-dca1_8274 : 0
compute7614-dca1_8274 : 0
compute7700-dca1_8274 : 0
compute7832-dca1_8274 : 0
compute7837-dca1_8274 : 0
compute8696-dca1_8274 : 0
compute8697-dca1_8274 : 0
compute8786-dca1_8274 : 0
compute8864-dca1_8274 : 0
compute8868-dca1_8274 : 0
mpdocker01-dca1_8274 : 0
mpdocker02-dca1_8274 : 0
mpdocker03-dca1_8274 : 0
mpdocker04-dca1_8274 : 0
mpdocker05-dca1_8274 : 0
mpdocker06-dca1_8274 : 0
mpdocker07-dca1_8274 : 0
mpdocker08-dca1_8274 : 0
mpdocker09-dca1_8274 : 0
agent1601-dca1_8274 : 2
agent201-dca1_8274 : 2
agent1415-dca1_8274 : 3
agent4605-dca1_8274 : 3
agent5212-dca1_8274 : 3
agent5236-dca1_8274 : 3
agent5242-dca1_8274 : 3
compute4763-dca1_8274 : 3
compute4916-dca1_8274 : 3
compute6933-dca1_8274 : 3
compute6984-dca1_8274 : 3
compute7713-dca1_8274 : 3
agent2213-dca1_8274 : 5
agent3394-dca1_8274 : 5
agent3618-dca1_8274 : 5
agent4574-dca1_8274 : 5
agent4677-dca1_8274 : 5
agent47-dca1_8274 : 5
compute2824-dca1_8274 : 5
compute3640-dca1_8274 : 5
compute3861-dca1_8274 : 5
compute7159-dca1_8274 : 5
compute7600-dca1_8274 : 5
compute7839-dca1_8274 : 5
compute2985-dca1_8274 : 6
compute3615-dca1_8274 : 6
compute4692-dca1_8274 : 6
agent2209-dca1_8274 : 8
agent2214-dca1_8274 : 8
compute3710-dca1_8274 : 8
compute6329-dca1_8274 : 8
agent5265-dca1_8274 : 9
compute7746-dca1_8274 : 13
agent5179-dca1_8274 : 14
agent4548-dca1_8274 : 15
agent3611-dca1_8274 : 20
agent3721-dca1_8274 : 23
compute3764-dca1_8274 : 23
agent3989-dca1_8274 : 30
agent4145-dca1_8274 : 51
compute3781-dca1_8274 : 55
agent2168-dca1_8274 : 60
agent5352-dca1_8274 : 68
agent3533-dca1_8274 : 78
compute4857-dca1_8274 : 78
compute2982-dca1_8274 : 110
agent4552-dca1_8274 : 113
appdocker1082-dca1_8274 : 135
appdocker538-dca1_8274 : 137
compute1620-dca1_8274 : 512
All LIVEINSTANCES do not have any message ->
$ zkcli -h dlmzk ls /DLM/LIVEINSTANCES | awk -F \' '{print $2}' | while read host; do echo
-n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES | awk -F \' '{print $2}' |
grep -v "^$" | wc -l ;done | sort -nk 3
agent1412-dca1_8274 : 0
agent1584-dca1_8274 : 0
agent2149-dca1_8274 : 0
agent3435-dca1_8274 : 0
agent3473-dca1_8274 : 0
agent3564-dca1_8274 : 0
agent3572-dca1_8274 : 0
agent3677-dca1_8274 : 0
agent4070-dca1_8274 : 0
agent4096-dca1_8274 : 0
agent6444-dca1_8274 : 0
compute3045-dca1_8274 : 0
compute3525-dca1_8274 : 0
compute3678-dca1_8274 : 0
compute4239-dca1_8274 : 0
compute4305-dca1_8274 : 0
compute4967-dca1_8274 : 0
compute4980-dca1_8274 : 0
compute6716-dca1_8274 : 0
compute6992-dca1_8274 : 0
{noformat}
 

Current Version: 0.8.2

StateModel: OfflineOnline
{code:java}
./helix-admin.sh -zkSvr dlmzk --listStateModel DLM OnlineOffline StateModelDefinition: { "id"
: "OnlineOffline", "mapFields" : { "DROPPED.meta" : { "count" : "-1" }, "OFFLINE.meta" : {
"count" : "-1" }, "OFFLINE.next" : { "DROPPED" : "DROPPED", "ONLINE" : "ONLINE" }, "ONLINE.meta"
: { "count" : "R" }, "ONLINE.next" : { "DROPPED" : "OFFLINE", "OFFLINE" : "OFFLINE" } }, "listFields"
: { "STATE_PRIORITY_LIST" : [ "ONLINE", "OFFLINE", "DROPPED" ], "STATE_TRANSITION_PRIORITYLIST"
: [ "OFFLINE-ONLINE", "ONLINE-OFFLINE", "OFFLINE-DROPPED" ] }, "simpleFields" : { "INITIAL_STATE"
: "OFFLINE" } }
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message