helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From j...@apache.org
Subject helix git commit: Adding monitoring metrics document. Link task framework related sections to the tutorial.
Date Wed, 25 Apr 2018 23:03:52 GMT
Repository: helix
Updated Branches:
  refs/heads/master 39629af65 -> bb1ba8962


Adding monitoring metrics document.
Link task framework related sections to the tutorial.


Project: http://git-wip-us.apache.org/repos/asf/helix/repo
Commit: http://git-wip-us.apache.org/repos/asf/helix/commit/bb1ba896
Tree: http://git-wip-us.apache.org/repos/asf/helix/tree/bb1ba896
Diff: http://git-wip-us.apache.org/repos/asf/helix/diff/bb1ba896

Branch: refs/heads/master
Commit: bb1ba8962b5033c09b0b2b9199a4d94ed11b8902
Parents: 39629af
Author: jiajunwang <ericwang1985@gmail.com>
Authored: Wed Apr 25 10:57:22 2018 -0700
Committer: jiajunwang <ericwang1985@gmail.com>
Committed: Wed Apr 25 14:18:42 2018 -0700

----------------------------------------------------------------------
 website/0.8.1/src/site/markdown/Metrics.md      | 217 +++++++++++++++++++
 website/0.8.1/src/site/markdown/index.md        |   3 +
 .../site/markdown/tutorial_task_framework.md    |  13 ++
 3 files changed, 233 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/helix/blob/bb1ba896/website/0.8.1/src/site/markdown/Metrics.md
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/markdown/Metrics.md b/website/0.8.1/src/site/markdown/Metrics.md
new file mode 100644
index 0000000..3a67b69
--- /dev/null
+++ b/website/0.8.1/src/site/markdown/Metrics.md
@@ -0,0 +1,217 @@
+<!---
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Helix Monitoring Metrics
+------------------
+
+Helix monitoring metrics are exposed as the MBeans attributes.
+The MBeans are registered based on instance role.
+
+The easiest way to see the available metrics is using jconsole and point it at a running
Helix instance.
+This will allow browsing all metrics with JMX.
+
+Note that if not mentioned in the attribute name, all attributes are gauge by default.
+
+### Metrics on Both Controller and Participant
+
+#### MBean ZkClientMonitor
+ObjectName: "HelixZkClient:type=[client-type],key=[specified-client-key],PATH=[zk-client-listening-path]"
+
+|Attributes|Description|
+|----------|-----------|
+|ReadCounter|Zk Read counter. Which could be used to identify unusually high/low ZK traffic|
+|WriteCounter|Same as above|
+|ReadBytesCounter|Same as above|
+|WriteBytesCounter|Same as above|
+|StateChangeEventCounter|Zk connection state change counter. Which could be used to identify
ZkClient unstable connection|
+|DataChangeEventCounter|Zk node data change counter. which could be used to identify unusual
high/low ZK events occurrence or slow event processing|
+|PendingCallbackGauge|Number of the pending Zk callbacks.|
+|TotalCallbackCounter|Number of total received Zk callbacks.|
+|TotalCallbackHandledCounter|Number of total handled Zk callbacks.|
+|ReadTotalLatencyCounter|Total read latency in ms.|
+|WriteTotalLatencyCounter|Total write latency in ms.|
+|WriteFailureCounter|Total write failures.|
+|ReadFailureCounter|Total read failures.|
+|ReadLatencyGauge|Histogram (with all statistic data) of read latency.|
+|WriteLatencyGauge|Histogram (with all statistic data) of write latency.|
+|ReadBytesGauge|Histogram (with all statistic data) of read bytes of single Zk access.|
+|WriteBytesGauge|Histogram (with all statistic data) of write bytes of single Zk access.|
+
+#### MBean HelixCallbackMonitor
+ObjectName: "HelixCallback:Type=[callback-type],Key=[cluster-name].[instance-name],Change=[callback-change-type]"
+
+|Attributes|Description|
+|----------|-----------|
+|Counter|Zk Callback counter for each Helix callback type.|
+|UnbatchedCounter|Unbatched Zk Callback counter for each helix callback type.|
+|LatencyCounter|Callback handler latency counter in ms.|
+|LatencyGauge|Histogram (with all statistic data) of Callback handler latency.|
+
+#### MBean MessageQueueMonitor
+ObjectName: "ClusterStatus:cluster=[cluster-name],messageQueue=[instance-name]"
+
+|Attributes|Description|
+|----------|-----------|
+|MessageQueueBacklog|Get the message queue size|
+
+### Metrics on Controller only
+
+#### MBean ClusterStatusMonitor
+ObjectName: "ClusterStatus:cluster=[cluster-name]"
+
+|Attributes|Description|
+|----------|-----------|
+|DisabledInstancesGauge|Current  number of disabled instances|
+|DisabledPartitionsGauge|Current number of disabled partitions number|
+|DownInstanceGauge|Current down instances number|
+|InstanceMessageQueueBacklog|The sum of all message queue sizes for instances in this cluster|
+|InstancesGauge|Current live instances number|
+|MaxMessageQueueSizeGauge|The maximum message queue size across all instances including controller|
+|RebalanceFailureGauge|None 0 if previous rebalance failed unexpectedly. The Gauge will be
set every time rebalance is done.|
+|RebalanceFailureCounter|The number of failures during rebalance pipeline.|
+|Enabled|1 if cluster is enabled, otherwise 0|
+|Maintenance|1 if cluster is in maintenance mode, otherwise 0|
+|Paused|1 if cluster is paused, otherwise 0|
+
+#### MBean ClusterEventMonitor
+ObjectName: "ClusterStatus:cluster=[cluster-name],eventName=ClusterEvent,phaseName=[event-handling-phase]"
+
+|Attributes|Description|
+|----------|-----------|
+|TotalDurationCounter|Total event process duration for each stage.|
+|MaxSingleDurationGauge|Max event process duration for each stage within the recent hour.|
+|EventCounter|The count of processed event in each stage.|
+|DurationGauge|Histogram (with all statistic data) of event process duration for each stage.|
+
+#### MBean InstanceMonitor
+ObjectName: "ClusterStatus:cluster=[cluster-name],instanceName=[instance-name]"
+
+|Attributes|Description|
+|----------|-----------|
+|Online|This instance is Online (1) or Offline (0)|
+|Enabled|This instance is Enabled (1) or Disabled (0)|
+|TotalMessageReceived|Number of messages sent to this instance by controller|
+|DisabledPartitions|Get the total disabled partitions number for this instance|
+
+#### MBean ResourceMonitor
+ObjectName: "ClusterStatus:cluster=[cluster-name],resourceName=[resource-name]"
+
+|Attributes|Description|
+|----------|-----------|
+|PartitionGauge|Get number of partitions of the resource in best possible ideal state for
this resource|
+|ErrorPartitionGauge|Get the number of current partitions in ERORR state for this resource|
+|DifferenceWithIdealStateGauge|Get the number of how many replicas' current state are different
from ideal state for this resource|
+|MissingTopStatePartitionGauge|Get the number of partitions do not have top state for this
resource|
+|ExternalViewPartitionGauge|Get number of partitions in ExternalView for this resource|
+|TotalMessageReceived|Get number of messages sent to this resource by controller|
+|LoadRebalanceThrottledPartitionGauge|Get number of partitions that need load rebalance but
were throttled.|
+|RecoveryRebalanceThrottledPartitionGauge|Get number of partitions that need recovery rebalance
but were throttled.|
+|PendingLoadRebalancePartitionGauge|Get number of partitions that have pending load rebalance
requests.|
+|PendingRecoveryRebalancePartitionGauge|Get number of partitions that have pending recovery
rebalance requests.|
+|MissingReplicaPartitionGauge|Get number of partitions that have replica number smaller than
expected.|
+|MissingMinActiveReplicaPartitionGauge|Get number of partitions that have replica number
smaller than the minimum requirement.|
+|MaxSinglePartitionTopStateHandoffDurationGauge|Get the max duration recorded when the top
state is missing in any single partition.|
+|FailedTopStateHandoffCounter|	Get the number of total top state transition failure.|
+|SucceededTopStateHandoffCounter|Get the number of total top state transition done successfully.|
+|SuccessfulTopStateHandoffDurationCounter|Get the total duration of all top state transitions.|
+|PartitionTopStateHandoffDurationGauge|Histogram (with all statistic data) of top state transition
duration.|
+
+#### MBean PerInstanceResourceMonitor
+ObjectName: "ClusterStatus:cluster=[cluster-name],instanceName=[instance-name],resourceName=[resource-name]"
+
+|Attributes|Description|
+|----------|-----------|
+|PartitionGauge|Get number of partitions of the resource in best possible ideal state for
this resource on specific instance|
+
+#### MBean JobMonitor
+ObjectName: "ClusterStatus:cluster=[cluster-name],jobType=[job-type]"
+
+|Attributes|Description|
+|----------|-----------|
+|SuccessfulJobCount|Get number of the succeeded jobs|
+|FailedJobCount|Get number of failed jobs|
+|AbortedJobCount|Get number of the aborted jobs|
+|ExistingJobGauge|Get number of existing jobs registered|
+|QueuedJobGauge|Get numbers of queued jobs, which are not running jobs|
+|RunningJobGauge|Get numbers of running jobs|
+|MaximumJobLatencyGauge|Get maximum latency of jobs running time. It will be cleared every
hour|
+|JobLatencyCount|Get total job latency counter.|
+
+#### MBean WorkflowMonitor
+ObjectName: "ClusterStatus:cluster=[cluster-name],workflowType=[workflow-type]"
+
+|Attributes|Description|
+|----------|-----------|
+|SuccessfulWorkflowCount|Get number of succeeded workflows|
+|FailedWorkflowCount|Get number of failed workflows|
+|FailedWorkflowGauge|Get number of current failed workflows|
+|ExistingWorkflowGauge|Get number of current existing workflows|
+|QueuedWorkflowGauge|Get number of queued but not started workflows|
+|RunningWorkflowGauge|Get number of running workflows|
+|WorkflowLatencyCount|Get workflow latency count|
+|MaximumWorkflowLatencyGauge|Get maximum workflow latency gauge. It will be reset in 1 hour.|
+
+### Metrics on Participant only
+
+#### MBean StateTransitionStatMonitor
+ObjectName: "CLMParticipantReport:Cluster=[cluster-name],Resource=[resource-name],Transition=[transaction-id]"
+
+|Attributes|Description|
+|----------|-----------|
+|TotalStateTransitionGauge|Get the number of total state transitions|
+|TotalFailedTransitionGauge|Get the number of total failed state transitions|
+|TotalSuccessTransitionGauge|Get the number of total succeeded state transitions|
+|MeanTransitionLatency|Get the average state transition latency (from message read to finish)|
+|MaxTransitionLatency|Get the maximum state transition latency|
+|MinTransitionLatency|Get the minimum state transition latency|
+|PercentileTransitionLatency|Get the percentile of state transitions latency|
+|MeanTransitionExecuteLatency|Get the average execution latency of state transition (from
task started to finish)|
+|MaxTransitionExecuteLatency|Get the maximum execution latency of state transition|
+|MinTransitionExecuteLatency|Get the minimum execution latency of state transition|
+|PercentileTransitionExecuteLatency|Get the percentile of execution latency of state transitions|
+
+#### MBean ThreadPoolExecutorMonitor
+ObjectName: "HelixThreadPoolExecutor:Type=[threadpool-type]" (threadpool-type in Message.MessageType,
BatchMessageExecutor, Task)
+
+|Attributes|Description|
+|----------|-----------|
+|ThreadPoolCoreSizeGauge|Thread pool size is as configured. Aggregate total thread pool size
for the whole cluster.|
+|ThreadPoolMaxSizeGauge|Same as above|
+|NumOfActiveThreadsGauge|Number of running threads.|
+|QueueSizeGauge|Queue size. Could be used to identify if too many HelixTask blocked in participant.|
+
+#### MBean MessageLatencyMonitor
+ObjectName: "CLMParticipantReport:ParticipantName=[instance-name],MonitorType=MessageLatencyMonitor"
+
+|Attributes|Description|
+|----------|-----------|
+|TotalMessageCount|Total message count|
+|TotalMessageLatency|Total message latency in ms|
+|MessagelatencyGauge|Histogram (with all statistic data) of message processing latency.|
+
+#### MBean ParticipantMessageMonitor
+ObjectName: "CLMParticipantReport:ParticipantName=[instance-name]"
+
+|Attributes|Description|
+|----------|-----------|
+|ReceivedMessages|Number of received messages|
+|DiscardedMessages|Number of discarded messages|
+|CompletedMessages|Number of completed messages|
+|FailedMessages|Number of failed messages|
+|PendingMessages|Number of pending messages to be processed|

http://git-wip-us.apache.org/repos/asf/helix/blob/bb1ba896/website/0.8.1/src/site/markdown/index.md
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/markdown/index.md b/website/0.8.1/src/site/markdown/index.md
index 30920be..c736316 100644
--- a/website/0.8.1/src/site/markdown/index.md
+++ b/website/0.8.1/src/site/markdown/index.md
@@ -49,3 +49,6 @@ under the License.
 
 [Distributed task DAG execution](./recipes/task_dag_execution.html)
 
+### Operation
+
+[Monitoring Metrics](./Metrics.html)

http://git-wip-us.apache.org/repos/asf/helix/blob/bb1ba896/website/0.8.1/src/site/markdown/tutorial_task_framework.md
----------------------------------------------------------------------
diff --git a/website/0.8.1/src/site/markdown/tutorial_task_framework.md b/website/0.8.1/src/site/markdown/tutorial_task_framework.md
index 9659ada..d348544 100644
--- a/website/0.8.1/src/site/markdown/tutorial_task_framework.md
+++ b/website/0.8.1/src/site/markdown/tutorial_task_framework.md
@@ -63,6 +63,7 @@ TaskConfig taskConfig = new TaskConfig(null, null, null, null);
 ```
 
 For these four fields:
+
 * Command: The task command, will use Job command if this is null
 * ID: Task unique id, will generate a new ID for this task if input is null
 * TaskTargetPartition: Target partition of a target. Could be null
@@ -129,8 +130,10 @@ taskDriver.getJobContext("JOBNAME").getInfo();
 
 As common use, one-time workflow will be the default workflow as user created. The first
step is to create a WorkflowConfig.Builder object with workflow name. Then all configs can
be set in WorkflowConfig.Builder. Once the configuration is done, [WorkflowConfig](https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/WorkflowConfig.java)
object can be got from WorkflowConfig.Builder object.
 We have two rules to validate the Workflow configuration:
+
 * Expiry time should not be less than 0
 * Schedule config should be valid either one-time or a positive interval magnitude (Recurrent
workflow)
+
 Example:
 
 ```
@@ -274,6 +277,7 @@ JobConfig myJobCfg = myJobCfgBuilder.build();
 ```
 
 Helix has couple rules to validate a job:
+
 * Each job must at least have one task to execute. For adding tasks and task rules please
refer following section Add Tasks.
 * Task timeout should not less than zero.
 * Number of concurrent tasks per instances should not less than one.
@@ -283,6 +287,7 @@ Helix has couple rules to validate a job:
 #### Add Tasks
 
 There are two ways of adding tasks:
+
 * Add by TaskConfig. Tasks can be added via adding TaskConfigs. User can create a List of
TaskConfigs or add TaskConfigMap, which is a task id to TaskConfig mapping.
 
 ```
@@ -363,10 +368,12 @@ If any parent jobs are not finished, the job won't be scheduled even
the schedul
 
 ### Monitor the status of your job
 As we introduced the excellent util TaskDriver in Workflow Section, we have extra more functionality
that provided to user. The user can synchronized wait Job or Workflow until it reaches certain
STATES. The function Helix have API pollForJobState and pollForWorkflowState. For pollForJobState,
it accepts arguments:
+
 * Workflow name, required
 * Job name, required
 * Timeout, not required, will be three minutes if user choose function without timeout argument.
Time unit is milisecond.
 * TaskStates, at least one state. This function can accept multiple TaskState, will end function
until one of those TaskState reaches.
+
 For example:
 
 ```
@@ -380,3 +387,9 @@ For pollForWorkflowState, it accepts similar arguments except Job name.
For exam
 taskDriver.pollForWorkflowState("MyWorkflowName", 180000L, TaskState.FAILED, TaskState.FATAL_FAILED);
 taskDriver.pollForWorkflowState("MyWorkflowName", TaskState.COMPLETED);
 ```
+
+#### Task framework monitoring metrics
+Please refer to following links.
+
+* [Job Monitor](./Metrics.html#MBean_JobMonitor)
+* [Workflow Monitor](./Metrics.html#MBean_WorkflowMonitor)


Mime
View raw message