helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Craig Murphey (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (HELIX-822) OnlineOffline cluster stops rebalancing
Date Fri, 20 Mar 2020 16:14:00 GMT

     [ https://issues.apache.org/jira/browse/HELIX-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Craig Murphey resolved HELIX-822.
---------------------------------
    Resolution: Fixed

Upgraded to the latest and we no longer see the issue. 

> OnlineOffline cluster stops rebalancing
> ---------------------------------------
>
>                 Key: HELIX-822
>                 URL: https://issues.apache.org/jira/browse/HELIX-822
>             Project: Apache Helix
>          Issue Type: Bug
>          Components: helix-core
>    Affects Versions: 0.8.x
>            Reporter: Craig Murphey
>            Priority: Major
>         Attachments: Screen Shot 2020-03-05 at 11.28.53 AM.png
>
>
> We recently upgraded our controller to use 0.8.4, then downgraded it back to 0.8.2. 
 After this and after some time after a controller is elected master, we've seen our LiveInstanceChangeListener
not get called for a live instance update.
> On the controller, we have a thread that's spun up on controller start that constantly
logs the external state and it sees the instance count decrease.
> At the same time as the expected notification to the listener, we do see a large amount
of zknodes being created and deleted.
> !Screen Shot 2020-03-05 at 11.28.53 AM.png!
> Upon inspection of our instances with helix-admin.sh, we found we have many more instances,
than we have live-instances (20 live instance, 60-100 instances).  This is because we register
the participant with hostname, which can change over time.
> Looking into these instances, we found many of the non-live instances have many messages
left over.
> We are able to mitigate the issue by restarting the master controller manually.
> How do left over instances affect the overall cluster health?  Is it possible that the
controller is trying to tell offline instances that their resource is dropped, which is preventing
the controller from issuing the live instance change event?
> Here's a snapshot of what we saw in zk:
>  
> {noformat}
> So, in DCA, there are a lot of messages in Zookeeper for instances that are not live
->
> $ zkcli -h dlmzk ls /DLM/INSTANCES | awk -F \' '{print $2}' | while read host; do echo
-n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES | awk -F \' '{print $2}' |
grep -v "^$" | wc -l ;done | sort -nk 3
> agent1016-dca1_8274 : 0
> agent1053-dca1_8274 : 0
> agent1100-dca1_8274 : 0
> agent1346-dca1_8274 : 0
> agent1397-dca1_8274 : 0
> agent1406-dca1_8274 : 0
> agent1412-dca1_8274 : 0
> agent1549-dca1_8274 : 0
> agent1558-dca1_8274 : 0
> agent1573-dca1_8274 : 0
> agent1584-dca1_8274 : 0
> agent211-dca1_8274 : 0
> agent2124-dca1_8274 : 0
> agent2148-dca1_8274 : 0
> agent2149-dca1_8274 : 0
> agent2153-dca1_8274 : 0
> agent2184-dca1_8274 : 0
> agent21-dca1_8274 : 0
> agent2287-dca1_8274 : 0
> agent2713-dca1_8274 : 0
> agent2763-dca1_8274 : 0
> agent27-dca1_8274 : 0
> agent2878-dca1_8274 : 0
> agent2900-dca1_8274 : 0
> agent2930-dca1_8274 : 0
> agent31-dca1_8274 : 0
> agent3372-dca1_8274 : 0
> agent3376-dca1_8274 : 0
> agent3435-dca1_8274 : 0
> agent3436-dca1_8274 : 0
> agent3473-dca1_8274 : 0
> agent3543-dca1_8274 : 0
> agent3564-dca1_8274 : 0
> agent3572-dca1_8274 : 0
> agent3601-dca1_8274 : 0
> agent3646-dca1_8274 : 0
> agent3647-dca1_8274 : 0
> agent3648-dca1_8274 : 0
> agent3651-dca1_8274 : 0
> agent3671-dca1_8274 : 0
> agent3677-dca1_8274 : 0
> agent3678-dca1_8274 : 0
> agent3699-dca1_8274 : 0
> agent3714-dca1_8274 : 0
> agent3726-dca1_8274 : 0
> agent3991-dca1_8274 : 0
> agent4070-dca1_8274 : 0
> agent4096-dca1_8274 : 0
> agent4121-dca1_8274 : 0
> agent4545-dca1_8274 : 0
> agent4581-dca1_8274 : 0
> agent4601-dca1_8274 : 0
> agent4612-dca1_8274 : 0
> agent4649-dca1_8274 : 0
> agent4650-dca1_8274 : 0
> agent4651-dca1_8274 : 0
> agent4664-dca1_8274 : 0
> agent4672-dca1_8274 : 0
> agent4678-dca1_8274 : 0
> agent46-dca1_8274 : 0
> agent4702-dca1_8274 : 0
> agent4722-dca1_8274 : 0
> agent4726-dca1_8274 : 0
> agent4729-dca1_8274 : 0
> agent4730-dca1_8274 : 0
> agent5233-dca1_8274 : 0
> agent5261-dca1_8274 : 0
> agent5284-dca1_8274 : 0
> agent63-dca1_8274 : 0
> agent6444-dca1_8274 : 0
> agent79-dca1_8274 : 0
> agent83-dca1_8274 : 0
> agent84-dca1_8274 : 0
> agent90-dca1_8274 : 0
> appdocker1204-dca1_8274 : 0
> appdocker1454-dca1_8274 : 0
> appdocker1858-dca1_8274 : 0
> appdocker1950-dca1_8274 : 0
> appdocker1966-dca1_8274 : 0
> appdocker1970-dca1_8274 : 0
> appdocker1985-dca1_8274 : 0
> appdocker2012-dca1_8274 : 0
> appdocker2046-dca1_8274 : 0
> appdocker255-dca1_8274 : 0
> appdocker30-dca1_8274 : 0
> appdocker507-dca1_8274 : 0
> appdocker568-dca1_8274 : 0
> appdocker580-dca1_8274 : 0
> appdocker61-dca1_8274 : 0
> appdocker661-dca1_8274 : 0
> appdocker693-dca1_8274 : 0
> appdocker77-dca1_8274 : 0
> appdocker791-dca1_8274 : 0
> appdocker874-dca1_8274 : 0
> appdocker909-dca1_8274 : 0
> appdocker949-dca1_8274 : 0
> compute1699-dca1_8274 : 0
> compute2072-dca1_8274 : 0
> compute228-dca1_8274 : 0
> compute2527-dca1_8274 : 0
> compute2541-dca1_8274 : 0
> compute2579-dca1_8274 : 0
> compute2608-dca1_8274 : 0
> compute2792-dca1_8274 : 0
> compute2822-dca1_8274 : 0
> compute2842-dca1_8274 : 0
> compute2849-dca1_8274 : 0
> compute2862-dca1_8274 : 0
> compute2928-dca1_8274 : 0
> compute2937-dca1_8274 : 0
> compute2946-dca1_8274 : 0
> compute295-dca1_8274 : 0
> compute2964-dca1_8274 : 0
> compute2999-dca1_8274 : 0
> compute3026-dca1_8274 : 0
> compute3045-dca1_8274 : 0
> compute3209-dca1_8274 : 0
> compute3217-dca1_8274 : 0
> compute3244-dca1_8274 : 0
> compute3247-dca1_8274 : 0
> compute3363-dca1_8274 : 0
> compute3373-dca1_8274 : 0
> compute3383-dca1_8274 : 0
> compute3385-dca1_8274 : 0
> compute3391-dca1_8274 : 0
> compute3413-dca1_8274 : 0
> compute3449-dca1_8274 : 0
> compute3452-dca1_8274 : 0
> compute3525-dca1_8274 : 0
> compute3526-dca1_8274 : 0
> compute3530-dca1_8274 : 0
> compute3546-dca1_8274 : 0
> compute3571-dca1_8274 : 0
> compute3584-dca1_8274 : 0
> compute3600-dca1_8274 : 0
> compute3621-dca1_8274 : 0
> compute3678-dca1_8274 : 0
> compute3691-dca1_8274 : 0
> compute3695-dca1_8274 : 0
> compute36-dca1_8274 : 0
> compute3750-dca1_8274 : 0
> compute3770-dca1_8274 : 0
> compute3809-dca1_8274 : 0
> compute3846-dca1_8274 : 0
> compute3857-dca1_8274 : 0
> compute3919-dca1_8274 : 0
> compute3985-dca1_8274 : 0
> compute4033-dca1_8274 : 0
> compute4036-dca1_8274 : 0
> compute4103-dca1_8274 : 0
> compute4141-dca1_8274 : 0
> compute4161-dca1_8274 : 0
> compute4191-dca1_8274 : 0
> compute4239-dca1_8274 : 0
> compute42-dca1_8274 : 0
> compute4305-dca1_8274 : 0
> compute4339-dca1_8274 : 0
> compute4396-dca1_8274 : 0
> compute4474-dca1_8274 : 0
> compute4502-dca1_8274 : 0
> compute4532-dca1_8274 : 0
> compute4548-dca1_8274 : 0
> compute4716-dca1_8274 : 0
> compute4764-dca1_8274 : 0
> compute4817-dca1_8274 : 0
> compute4873-dca1_8274 : 0
> compute4887-dca1_8274 : 0
> compute4900-dca1_8274 : 0
> compute4924-dca1_8274 : 0
> compute4962-dca1_8274 : 0
> compute4966-dca1_8274 : 0
> compute4967-dca1_8274 : 0
> compute4980-dca1_8274 : 0
> compute4994-dca1_8274 : 0
> compute4998-dca1_8274 : 0
> compute5303-dca1_8274 : 0
> compute5338-dca1_8274 : 0
> compute5659-dca1_8274 : 0
> compute5661-dca1_8274 : 0
> compute5675-dca1_8274 : 0
> compute5698-dca1_8274 : 0
> compute5710-dca1_8274 : 0
> compute5933-dca1_8274 : 0
> compute5978-dca1_8274 : 0
> compute6011-dca1_8274 : 0
> compute6034-dca1_8274 : 0
> compute6089-dca1_8274 : 0
> compute6269-dca1_8274 : 0
> compute6339-dca1_8274 : 0
> compute6358-dca1_8274 : 0
> compute6366-dca1_8274 : 0
> compute6432-dca1_8274 : 0
> compute6716-dca1_8274 : 0
> compute6717-dca1_8274 : 0
> compute6767-dca1_8274 : 0
> compute6791-dca1_8274 : 0
> compute6825-dca1_8274 : 0
> compute6892-dca1_8274 : 0
> compute68-dca1_8274 : 0
> compute6905-dca1_8274 : 0
> compute6937-dca1_8274 : 0
> compute6992-dca1_8274 : 0
> compute6994-dca1_8274 : 0
> compute7029-dca1_8274 : 0
> compute7179-dca1_8274 : 0
> compute73-dca1_8274 : 0
> compute7582-dca1_8274 : 0
> compute7586-dca1_8274 : 0
> compute7601-dca1_8274 : 0
> compute7614-dca1_8274 : 0
> compute7700-dca1_8274 : 0
> compute7832-dca1_8274 : 0
> compute7837-dca1_8274 : 0
> compute8696-dca1_8274 : 0
> compute8697-dca1_8274 : 0
> compute8786-dca1_8274 : 0
> compute8864-dca1_8274 : 0
> compute8868-dca1_8274 : 0
> mpdocker01-dca1_8274 : 0
> mpdocker02-dca1_8274 : 0
> mpdocker03-dca1_8274 : 0
> mpdocker04-dca1_8274 : 0
> mpdocker05-dca1_8274 : 0
> mpdocker06-dca1_8274 : 0
> mpdocker07-dca1_8274 : 0
> mpdocker08-dca1_8274 : 0
> mpdocker09-dca1_8274 : 0
> agent1601-dca1_8274 : 2
> agent201-dca1_8274 : 2
> agent1415-dca1_8274 : 3
> agent4605-dca1_8274 : 3
> agent5212-dca1_8274 : 3
> agent5236-dca1_8274 : 3
> agent5242-dca1_8274 : 3
> compute4763-dca1_8274 : 3
> compute4916-dca1_8274 : 3
> compute6933-dca1_8274 : 3
> compute6984-dca1_8274 : 3
> compute7713-dca1_8274 : 3
> agent2213-dca1_8274 : 5
> agent3394-dca1_8274 : 5
> agent3618-dca1_8274 : 5
> agent4574-dca1_8274 : 5
> agent4677-dca1_8274 : 5
> agent47-dca1_8274 : 5
> compute2824-dca1_8274 : 5
> compute3640-dca1_8274 : 5
> compute3861-dca1_8274 : 5
> compute7159-dca1_8274 : 5
> compute7600-dca1_8274 : 5
> compute7839-dca1_8274 : 5
> compute2985-dca1_8274 : 6
> compute3615-dca1_8274 : 6
> compute4692-dca1_8274 : 6
> agent2209-dca1_8274 : 8
> agent2214-dca1_8274 : 8
> compute3710-dca1_8274 : 8
> compute6329-dca1_8274 : 8
> agent5265-dca1_8274 : 9
> compute7746-dca1_8274 : 13
> agent5179-dca1_8274 : 14
> agent4548-dca1_8274 : 15
> agent3611-dca1_8274 : 20
> agent3721-dca1_8274 : 23
> compute3764-dca1_8274 : 23
> agent3989-dca1_8274 : 30
> agent4145-dca1_8274 : 51
> compute3781-dca1_8274 : 55
> agent2168-dca1_8274 : 60
> agent5352-dca1_8274 : 68
> agent3533-dca1_8274 : 78
> compute4857-dca1_8274 : 78
> compute2982-dca1_8274 : 110
> agent4552-dca1_8274 : 113
> appdocker1082-dca1_8274 : 135
> appdocker538-dca1_8274 : 137
> compute1620-dca1_8274 : 512
> All LIVEINSTANCES do not have any message ->
> $ zkcli -h dlmzk ls /DLM/LIVEINSTANCES | awk -F \' '{print $2}' | while read host; do
echo -n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES | awk -F \' '{print $2}'
| grep -v "^$" | wc -l ;done | sort -nk 3
> agent1412-dca1_8274 : 0
> agent1584-dca1_8274 : 0
> agent2149-dca1_8274 : 0
> agent3435-dca1_8274 : 0
> agent3473-dca1_8274 : 0
> agent3564-dca1_8274 : 0
> agent3572-dca1_8274 : 0
> agent3677-dca1_8274 : 0
> agent4070-dca1_8274 : 0
> agent4096-dca1_8274 : 0
> agent6444-dca1_8274 : 0
> compute3045-dca1_8274 : 0
> compute3525-dca1_8274 : 0
> compute3678-dca1_8274 : 0
> compute4239-dca1_8274 : 0
> compute4305-dca1_8274 : 0
> compute4967-dca1_8274 : 0
> compute4980-dca1_8274 : 0
> compute6716-dca1_8274 : 0
> compute6992-dca1_8274 : 0
> {noformat}
>  
> Current Version: 0.8.2
> StateModel: OfflineOnline
> {code:java}
> ./helix-admin.sh -zkSvr dlmzk --listStateModel DLM OnlineOffline StateModelDefinition:
{ "id" : "OnlineOffline", "mapFields" : { "DROPPED.meta" : { "count" : "-1" }, "OFFLINE.meta"
: { "count" : "-1" }, "OFFLINE.next" : { "DROPPED" : "DROPPED", "ONLINE" : "ONLINE" }, "ONLINE.meta"
: { "count" : "R" }, "ONLINE.next" : { "DROPPED" : "OFFLINE", "OFFLINE" : "OFFLINE" } }, "listFields"
: { "STATE_PRIORITY_LIST" : [ "ONLINE", "OFFLINE", "DROPPED" ], "STATE_TRANSITION_PRIORITYLIST"
: [ "OFFLINE-ONLINE", "ONLINE-OFFLINE", "OFFLINE-DROPPED" ] }, "simpleFields" : { "INITIAL_STATE"
: "OFFLINE" } }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message