asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Westmann" <ti...@apache.org>
Subject Re: Requesting node X to shutdown to ensure failure?
Date Sun, 30 Sep 2018 19:40:39 GMT
Hi Taewoo,

I’m little confused as we seem to have 2 cc logs.

The "Requesting node 2 to shutdown …" message is logged, when nodes 
are considered to be dead as they have missed too many heartbeats. And 
you can see in the later logs that indeed more than 80s passed without 
receiving a heartbeat.
Could it be that the machines were too busy to send/receive heartbeats?

Cheers,
Till

On 30 Sep 2018, at 10:52, Taewoo Kim wrote:

> Hi devs,
>
> I saw the following log messages in the cc.log of the Cloudberrry 
> cluster.
> Yesterday around 9:50 pm, a task cluster was created and executed. And
> around 9:51 pm, the cluster shows the following message. In the nc.log 
> of
> the node 2, there were no noticeable log messages. I wonder what could
> cause this situation.
>
> -- cc.log at 9:51 pm
> *21:51:54.203 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 2 
> to
> shutdown to ensure failure*
>
> *-- nc-2.log around 9:50 pm*
> 21:50:30.314 [Worker:2] INFO
>  org.apache.hyracks.control.nc.work.StartTasksWork - input: 0: CDID:2
> 21:50:30.375 [Worker:2] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:7:0:3:0
> 21:50:30.382 [Worker:2] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:9:0:3:0
> 21:50:30.390 [Worker:2] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:9:0:2:0
> 21:50:30.391 [Worker:2] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:7:0:2:0
> 21:50:30.392 [Worker:2] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:3:0
> 21:50:30.392 [Worker:2] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:2:0
> 21:50:30.398 [Worker:2] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:3:0
> 21:50:30.398 [Worker:2] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:2:0
> 22:02:30.290 [JVM exit thread] INFO  org.apache.hyracks.util.ExitUtil 
> - JVM
> exiting with status 0; bye!
>
>
> -- *cc.log from 9:51 pm*
> 21:50:30.313 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.ActivityClusterPlanner - Built 
> 1
> Task Clusters
> 21:50:30.313 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.ActivityClusterPlanner - 
> Tasks:
> [TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1, TID:ANID:ODID:1:1:2,
> TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4, TID:ANID:ODID:1:1:5,
> TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7, TID:ANID:ODID:1:1:8,
> TID:ANID:ODID:1:1:9, TID:ANID:ODID:2:0:0, TID:ANID:ODID:2:0:1,
> TID:ANID:ODID:2:0:2, TID:ANID:ODID:2:0:3, TID:ANID:ODID:2:0:4,
> TID:ANID:ODID:2:0:5, TID:ANID:ODID:2:0:6, TID:ANID:ODID:2:0:7,
> TID:ANID:ODID:2:0:8, TID:ANID:ODID:2:0:9, TID:ANID:ODID:9:0:0,
> TID:ANID:ODID:9:0:1, TID:ANID:ODID:9:0:2, TID:ANID:ODID:9:0:3,
> TID:ANID:ODID:9:0:4, TID:ANID:ODID:9:0:5, TID:ANID:ODID:9:0:6,
> TID:ANID:ODID:9:0:7, TID:ANID:ODID:9:0:8, TID:ANID:ODID:9:0:9]
> 21:50:30.313 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - Runnable TC 
> roots:
> [TC:[TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1, TID:ANID:ODID:1:1:2,
> TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4, TID:ANID:ODID:1:1:5,
> TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7, TID:ANID:ODID:1:1:8,
> TID:ANID:ODID:1:1:9, TID:ANID:ODID:2:0:0, TID:ANID:ODID:2:0:1,
> TID:ANID:ODID:2:0:2, TID:ANID:ODID:2:0:3, TID:ANID:ODID:2:0:4,
> TID:ANID:ODID:2:0:5, TID:ANID:ODID:2:0:6, TID:ANID:ODID:2:0:7,
> TID:ANID:ODID:2:0:8, TID:ANID:ODID:2:0:9, TID:ANID:ODID:9:0:0,
> TID:ANID:ODID:9:0:1, TID:ANID:ODID:9:0:2, TID:ANID:ODID:9:0:3,
> TID:ANID:ODID:9:0:4, TID:ANID:ODID:9:0:5, TID:ANID:ODID:9:0:6,
> TID:ANID:ODID:9:0:7, TID:ANID:ODID:9:0:8, TID:ANID:ODID:9:0:9]],
> inProgressTaskClusters: []
> 21:50:30.366 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:7:0:5:0]
> 21:50:30.374 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:7:0:9:0]
> 21:50:30.376 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:7:0:3:0]
> 21:50:30.376 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:7:0:8:0]
> 21:50:30.377 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:7:0:0:0]
> 21:50:30.378 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:7:0:4:0]
> 21:50:30.381 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:7:0:1:0]
> 21:50:30.382 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:9:0:8:0]
> 21:50:30.383 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:9:0:3:0]
> 21:50:30.384 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:9:0:4:0]
> 21:50:30.386 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:9:0:5:0]
> 21:50:30.386 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:9:0:9:0]
> 21:50:30.389 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:9:0:6:0]
> 21:50:30.389 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:9:0:0:0]
> 21:50:30.392 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:9:0:2:0]
> 21:50:30.392 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:7:0:7:0]
> 21:50:30.392 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:7:0:6:0]
> 21:50:30.394 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:9:0:1:0]
> 21:50:30.395 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:7:0:2:0]
> 21:50:30.395 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:9:0:7:0]
> 21:50:30.396 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:1:1:1:0]
> 21:50:30.396 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:1:1:0:0]
> 21:50:30.396 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:1:1:0:0]
> 21:50:30.396 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:1:1:1:0]
> 21:50:30.397 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:1:1:8:0]
> 21:50:30.398 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:1:1:4:0]
> 21:50:30.398 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:1:1:5:0]
> 21:50:30.398 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:1:1:4:0]
> 21:50:30.398 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:1:1:5:0]
> 21:50:30.399 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:1:1:6:0]
> 21:50:30.399 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:1:1:3:0]
> 21:50:30.399 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:1:1:2:0]
> 21:50:30.399 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:1:1:9:0]
> 21:50:30.399 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:1:1:8:0]
> 21:50:30.399 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:1:1:9:0]
> 21:50:30.402 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:1:1:7:0]
> 21:50:30.402 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:1:1:6:0]
> 21:50:30.402 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:1:1:7:0]
> 21:50:30.402 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:5:0]
> 21:50:30.402 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:4:0]
> 21:50:30.402 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:5:0]
> 21:50:30.402 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:4:0]
> 21:50:30.404 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:1:1:3:0]
> 21:50:30.404 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.common.work.WorkQueue - Executing:
> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:1:1:2:0]
> *21:51:54.203 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 2 
> to
> shutdown to ensure failure*
> 21:51:54.206 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - Request to 
> shutdown
> failed node 2 succeeded. false positive heartbeat miss indication
> 21:51:54.206 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - 2 considered 
> dead.
> Last heartbeat received 83799ms ago. Max miss period: 80000ms
> 21:51:54.206 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 5 
> to
> shutdown to ensure failure
> 21:51:54.206 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - Request to 
> shutdown
> failed node 5 succeeded. false positive heartbeat miss indication
> 21:51:54.206 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - 5 considered 
> dead.
> Last heartbeat received 83806ms ago. Max miss period: 80000ms
> 21:51:54.207 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 3 
> to
> shutdown to ensure failure
> 21:51:54.207 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - Request to 
> shutdown
> failed node 3 succeeded. false positive heartbeat miss indication
> 21:51:54.207 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - 3 considered 
> dead.
> Last heartbeat received 83804ms ago. Max miss period: 80000ms
> 21:51:54.207 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 4 
> to
> shutdown to ensure failure
> 21:51:54.207 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - Request to 
> shutdown
> failed node 4 succeeded. false positive heartbeat miss indication
> 21:51:54.207 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.cluster.NodeManager - 4 considered 
> dead.
> Last heartbeat received 83804ms ago. Max miss period: 80000ms
> 21:51:54.207 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.work.RemoveDeadNodesWork - Number of
> affected jobs: 3
> 21:51:54.216 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - Abort map for 
> job:
> JID:0.1: {1=[TAID:TID:ANID:ODID:0:0:0:0, TAID:TID:ANID:ODID:3:0:0:0,
> TAID:TID:ANID:ODID:3:0:1:0]}
> 21:51:54.216 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - Aborting:
> [TAID:TID:ANID:ODID:0:0:0:0, TAID:TID:ANID:ODID:3:0:0:0,
> TAID:TID:ANID:ODID:3:0:1:0] at 1
> 21:51:54.217 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - 
> Removing
> uncommitted partitions: []
> 21:51:54.217 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - 
> Removing
> partition requests: []
> 21:51:54.218 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - aborting doomed 
> task
> clusters
> 21:51:54.218 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - number of doomed 
> task
> clusters found = 0
> 21:51:54.218 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - Runnable TC 
> roots:
> [TC:[TID:ANID:ODID:0:0:0, TID:ANID:ODID:3:0:0, TID:ANID:ODID:3:0:1,
> TID:ANID:ODID:3:0:2, TID:ANID:ODID:3:0:3, TID:ANID:ODID:3:0:4,
> TID:ANID:ODID:3:0:5, TID:ANID:ODID:3:0:6, TID:ANID:ODID:3:0:7,
> TID:ANID:ODID:3:0:8, TID:ANID:ODID:3:0:9]], inProgressTaskClusters: []
> 21:51:54.218 [Worker:ClusterController] ERROR
> org.apache.hyracks.control.cc.executor.JobExecutor - Unexpected 
> failure.
> Aborting job JID:0.1
> org.apache.hyracks.api.exceptions.HyracksException: Node 2 not live
>         at
> org.apache.hyracks.control.cc.executor.JobExecutor.assignLocation(JobExecutor.java:473)
> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>         at
> org.apache.hyracks.control.cc.executor.JobExecutor.assignTaskLocations(JobExecutor.java:365)
> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>         at
> org.apache.hyracks.control.cc.executor.JobExecutor.startRunnableTaskClusters(JobExecutor.java:245)
> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>         at
> org.apache.hyracks.control.cc.executor.JobExecutor.startRunnableActivityClusters(JobExecutor.java:209)
> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>         at
> org.apache.hyracks.control.cc.executor.JobExecutor.notifyNodeFailures(JobExecutor.java:731)
> [hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>         at
> org.apache.hyracks.control.cc.work.RemoveDeadNodesWork.run(RemoveDeadNodesWork.java:60)
> [hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
>         at
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127)
> [hyracks-control-common-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
> 21:51:54.218 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - Abort map for 
> job:
> JID:0.700: {}
> 21:51:54.218 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - 
> Removing
> uncommitted partitions: []
> 21:51:54.218 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - 
> Removing
> partition requests: []
> 21:51:54.218 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - aborting doomed 
> task
> clusters
> 21:51:54.219 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - number of doomed 
> task
> clusters found = 0
> 21:51:54.219 [Worker:ClusterController] INFO
>  org.apache.hyracks.control.cc.executor.JobExecutor - Abort map for 
> job:
> JID:0.700: {1=[TAID:TID:ANID:ODID:2:0:0:0, 
> TAID:TID:ANID:ODID:2:0:1:0]}
>
> Best,
> Taewoo

Mime
View raw message