metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From simonellistonball <...@git.apache.org>
Subject [GitHub] metron pull request #614: METRON-992: Create performance tuning guide
Date Thu, 08 Jun 2017 19:02:38 GMT
Github user simonellistonball commented on a diff in the pull request:

    https://github.com/apache/metron/pull/614#discussion_r120961066
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -0,0 +1,326 @@
    +# Metron Performance Tunining Guide
    +
    +## Overview
    +
    +This document provides guidance from our experiences tuning the Apache Metron Storm topologies
for maximum performance. You'll find
    +suggestions for optimum configurations under a 1 gbps load along with some guidance around
the tooling we used to monitor and assess
    +our throughput.
    +
    +In the simplest terms, Metron is a streaming architecture created on top of Kafka and
three main types of Storm topologies: parsers,
    +enrichment, and indexing. Each parser has it's own topology and there is also a highly
performant, specialized spout-only topology
    +for streaming PCAP data to HDFS. We found that the architecture can be tuned almost exclusively
through using a few primary Storm and
    +Kafka parameters along with a few Metron-specific options. You can think of the data
flow as being similar to water flowing through a
    +pipe, and the majority of these options assist in tweaking the various pipe widths in
the system.
    +
    +## General Suggestions
    +
    +Note that there is currently no method for specifying the number of tasks from the number
of executors in Flux topologies (enrichment,
    + indexing). By default, the number of tasks will equal the number of executors. Logically,
setting the number of tasks equal to the number
    +of executors is sensible. Storm enforces # executors <= # tasks. The reason you might
set the number of tasks higher than the number of
    +executors is for future performance tuning and rebalancing without the need to bring
down your topologies. The number of tasks is fixed
    +at topology startup time whereas the number of executors can be increased up to a maximum
value equal to the number of tasks.
    +
    +We found that the default values for poll.timeout.ms, offset.commit.period.ms, and max.uncommitted.offsets
worked well in nearly all cases.
    +As a general rule, it was optimal to set spout parallelism equal to the number of partitions
used in your Kafka topic. Any greater
    +parallelism will leave you with idle consumers since Kafka limits the max number of consumers
to the number of partitions. This is
    +important because Kafka has certain ordering guarantees for message delivery per partition
that would not be possible if more than
    +one consumer in a given consumer group were able to read from that partition.
    +
    +## Tooling
    +
    +Before we get to the actual tooling used to monitor performance, it helps to describe
what we might actually want to monitor and potential
    +pain points. Prior to switching over to the new Storm Kafka client, which leverages the
new Kafka consumer API under the hood, offsets
    +were stored in Zookeeper. While the broker hosts are still stored in Zookeeper, this
is no longer true for the offsets which are now
    +stored in Kafka itself. This is a configurable option, and you may switch back to Zookeeper
if you choose, but Metron is currently using
    +the new defaults. This is useful to know as you're investigating both correctness as
well as throughput performance.
    +
    +First we need to setup some environment variables
    +```
    +export BROKERLIST=<your broker comma-delimated list of host:ports>
    +export ZOOKEEPER=<your zookeeper comma-delimated list of host:ports>
    +export KAFKA_HOME=<kafka home dir>
    +export METRON_HOME=<your metron home>
    +export HDP_HOME=<your HDP home>
    +```
    +
    +If you have Kerberos enabled, setup the security protocol
    +```
    +$ cat /tmp/consumergroup.config
    +security.protocol=SASL_PLAINTEXT
    +```
    +
    +Now run the following command for a running topology's consumer group. In this example
we are using enrichments.
    +```
    +${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +This will return a table with the following output depicting offsets for all partitions
and consumers associated with the specified
    +consumer group:
    +```
    +GROUP                          TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET
 LAG             OWNER
    +enrichments                    enrichments        9          29746066        29746067
       1               consumer-2_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        3          29754325        29754326
       1               consumer-1_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        43         29754331        29754332
       1               consumer-6_/xxx.xxx.xxx.xxx
    +...
    +```
    +
    +_Note_: You won't see any output until a topology is actually running because the consumer
groups only exist while consumers in the
    +spouts are up and running.
    +
    +The primary column we're concerned with paying attention to is the LAG column, which
is the current delta calculation between the
    +current and end offset for the partition. This tells us how close we are to keeping up
with incoming data. And, as we found through
    +multiple trials, whether there are any problems with specific consumers getting stuck.
    +
    +Taking this one step further, it's probably more useful if we can watch the offsets and
lags change over time. In order to do this
    +we'll add a "watch" command and set the refresh rate to 10 seconds.
    +
    +```
    +watch -n 10 -d ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +Every 10 seconds the command will re-run and the screen will be refreshed with new information.
The most useful bit is that the
    +watch command will highlight the differences from the current output and the last output
screens.
    +
    +We can also monitor our Storm topologies by using the Storm UI - see http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
    +
    +And lastly, you can leverage some GUI tooling to make creating and modifying your Kafka
topics a bit easier -
    +see https://github.com/yahoo/kafka-manager
    +
    +## General Knobs and Levers
    +
    +Kafka
    +    - # partitions
    +Storm
    +    Kafka
    +        - polling frequency and timeouts
    +    - # workers
    +    - ackers
    +    - max spout pending
    +    - spout parallelism
    +    - bolt parallelism
    +    - # executors
    +Metron
    +    - bolt cache size - handles how many messages can be cached. This cache is used while
waiting for all parts of the message to be rejoined.
    +
    +## Topologies
    +
    +### Parsers
    +
    +The parsers and PCAP use a builder utility, as opposed to enrichments and indexing, which
use Flux.
    +
    +We set the number of partitions for our inbound Kafka topics to 48.
    +
    +```
    +$ cat ~metron/.storm/storm-bro.config
    +
    +{
    +    ...
    +    "topology.max.spout.pending" : 2000
    +    ...
    +}
    +```
    +
    +These are the spout recommended defaults from Storm and are currently the defaults provided
in the Kafka spout itself.
    +In fact, if you find the recommended defaults work fine for you, then this file might
not be necessary at all.
    +```
    +$ cat ~/.storm/spout-bro.config
    +{
    +    ...
    +    "spout.pollTimeoutMs" : 200,
    +    "spout.maxUncommittedOffsets" : 10000000,
    +    "spout.offsetCommitPeriodMs" : 30000
    +}
    +```
    +
    +We ran our bro parser topology with the following options
    --- End diff --
    
    Is there any explanation of how we got to these numbers? There will be heavily dependent
on hardware, particularly cpu and disk configuration, so the context could be very important.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message