kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guozhang Wang <wangg...@gmail.com>
Subject Re: Need some help in identifying some important metrics to monitor for streams
Date Fri, 03 Mar 2017 19:06:28 GMT
Sachin,

The reason that you got metrics name as

new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1


Is that you did not set the "CLIENT_ID_CONFIG" in your app, and
KafkaStreams have to use a default combo of "appID:
new-part-advice"-"processID: a UUID to guarantee uniqueness across
machines" as its clientId.


As for metricsName, it is always set as "clientId + "-" + threadName" where
"StreamThread-1" is your threadName which is unique WITHIN the JVM and that
is why we still need the globally unique clientId for distinguishment.

I just checked the source code and this logic was not changed from 0.10.1
to 0.10.2, so I guess you set your clientId as "new-advice-1" as well in
0.10.1?


Guozhang



On Fri, Mar 3, 2017 at 4:02 AM, Eno Thereska <eno.thereska@gmail.com> wrote:

> Hi Sachin,
>
> Now that the confluent platform 3.2 is out, we also have some more
> documentation on this here: http://docs.confluent.io/3.2.
> 0/streams/monitoring.html <http://docs.confluent.io/3.2.
> 0/streams/monitoring.html>. We added a note on how to add other metrics.
>
> Yeah, your calculation on poll time makes sense. The important metrics are
> the “info” ones that are on by default. However, for stageful applications,
> if you suspect that state stores might be bottlenecking, you might want to
> collect those metrics too.
>
> On the benchmarks, the one called “processstreamwithstatestore” and
> “count” are the closest to a benchmarking on RocksDb with the default
> configs. The first writes each record to RocksDb, while the second performs
> simple aggregates (reads and writes from/to RocksDb).
>
> We might need to add more benchmarks here, would be great to get some
> ideas and help from the community. E.g., a pure RocksDb benchmark that
> doesn’t go through streams at all.
>
> Could you open a JIRA on the name issue please? As an “improvement”.
>
> Thanks
> Eno
>
>
>
> > On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sjmittal@gmail.com> wrote:
> >
> > Hi,
> > I had checked the monitoring docs, but could not figure out which metrics
> > are important ones.
> >
> > Also mainly I am looking at the average time spent between 2 successive
> > poll requests.
> > Can I say that average time between 2 poll requests is sum of
> >
> > commit + poll + process + punctuate (latency-avg).
> >
> >
> > Also I checked the benchmark tests results but could not find any
> > information on rocksdb metrics for fetch and put operations.
> > Is there any benchmark for these or based on my values in previous mail
> can
> > something be commented on its performance.
> >
> >
> > Lastly can we get some help on names like new-part-advice-d1094e71-0f59-
> > 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name of
> thread
> > like new-advice-1-StreamThread-1(as in version 10.1.1) so we can log
> these
> > metrics as part of out cron jobs.
> >
> > Thanks
> > Sachin
> >
> >
> >
> > On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <eno.thereska@gmail.com>
> wrote:
> >
> >> Hi Sachin,
> >>
> >> The new streams metrics are now documented at https://kafka.apache.org/
> >> documentation/#kafka_streams_monitoring <https://kafka.apache.org/
> >> documentation/#kafka_streams_monitoring>. Note that not all of them are
> >> turned on by default.
> >>
> >> We have several benchmarks that run nightly to monitor streams
> >> performance. They all stem from the SimpleBenchmark.java benchmark. In
> >> addition, their results are published nightly here
> >> http://testing.confluent.io <http://testing.confluent.io/>, (e.g.,
> under
> >> the trunk results). E.g., looking at today's results:
> >> http://confluent-kafka-system-test-results.s3-us-west-2.
> >> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
> >> ef92bb4/report.html <http://confluent-kafka-system-test-results.s3-us-
> >> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
> >> trunk--ef92bb4/report.html>
> >> (if you search for "benchmarks.streams") you'll see results from a
> series
> >> of benchmarks, ranging from simply consuming, to simple topologies with
> a
> >> source and sink, to joins and count aggregate. These run on AWS nightly,
> >> but you can also run manually on your setup.
> >>
> >> In addition, programmatically the code can check the
> KafkaStreams.state()
> >> and register listeners for when the state changes. For example, the
> state
> >> can change from "running" to "rebalancing".
> >>
> >> It is likely we'll need more metrics moving forward and would be great
> to
> >> get feedback from the community.
> >>
> >>
> >> Thanks
> >> Eno
> >>
> >>
> >>
> >>
> >>> On 2 Mar 2017, at 11:54, Sachin Mittal <sjmittal@gmail.com> wrote:
> >>>
> >>> Hello All,
> >>> I had few questions regarding monitoring of kafka streams application
> and
> >>> what are some important metrics we should collect in our case.
> >>>
> >>> Just a brief overview, we have a single thread application (0.10.1.1)
> >>> reading from single partition topic and it is working all fine.
> >>> Then we have same application (using 0.10.2.0) multi threaded with 4
> >>> threads per machine and 3 machines cluster setup reading for same but
> >>> partitioned topic (12 partitions).
> >>> Thus we have each thread processing single partition same case as
> earlier
> >>> one.
> >>>
> >>> The new setup also works fine in steady state, but under load somehow
> it
> >>> triggers frequent re-balance and then we run into all sort of issues
> like
> >>> stream thread dying due to CommitFailedException or entering into
> >> deadlock
> >>> state.
> >>> After a while we restart all the instances then it works fine for a
> while
> >>> and again we get the same problem and it goes on.
> >>>
> >>> 1. So just to monitor, like when first thread fails what would be some
> >>> important metrics we should be collecting to get some sense of whats
> >> going
> >>> on?
> >>>
> >>> 2. Is there any metric that tells time elapsed between successive poll
> >>> requests, so we can monitor that?
> >>>
> >>> Also I did monitor rocksdb put and fetch times for these 2 instances
> and
> >>> here is the output I get:
> >>> 0.10.1.1
> >>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >> id=new-advice-1-StreamThread-1
> >>> key-table-put-avg-latency-ms
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-advice-1-StreamThread-1:
> >>> 206431.7497615029
> >>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >> id=new-advice-1-StreamThread-1
> >>> key-table-fetch-avg-latency-ms
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-advice-1-StreamThread-1:
> >>> 2595394.2746129474
> >>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >> id=new-advice-1-StreamThread-1
> >>> key-table-put-qps
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-advice-1-StreamThread-1:
> >>> 232.86299499317252
> >>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >> id=new-advice-1-StreamThread-1
> >>> key-table-fetch-qps
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-advice-1-StreamThread-1:
> >>> 373.61071016166284
> >>>
> >>> Same values for 0.10.2.0 I get
> >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>> key-table-put-latency-avg
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1:
> >>> 1199859.5535022356
> >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>> key-table-fetch-latency-avg
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1:
> >>> 3679340.80748852
> >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>> key-table-put-rate
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1:
> >>> 56.134778706069184
> >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>> key-table-fetch-rate
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1:
> >>> 136.10721427931827
> >>>
> >>> I notice that result in 10.2.0 is much worse than same for 10.1.1
> >>>
> >>> I would like to know
> >>> 1. Is there any benchmark on rocksdb as at what rate/latency it should
> be
> >>> doing put/fetch operations.
> >>>
> >>> 2. What could be the cause of inferior numbers in 10.2.0, is it because
> >>> this application is also running three other threads doing the same
> >> thing.
> >>>
> >>> 3. Also whats with the name new-part-advice-d1094e71-
> >>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>>   I wanted to put this as a part of my cronjob, so why can't we have
> >>> simpler name like we have in 10.1.1, so it is easy to write the script.
> >>>
> >>> Thanks
> >>> Sachin
> >>
> >>
>
>


-- 
-- Guozhang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message