So are you suggesting that the long delays happened in %1 percentile
happens in the slower partitions that are further away? Thanks.
On Wed, Sep 9, 2015 at 3:15 PM, Helleren, Erik <Erik.Helleren@cmegroup.com>
wrote:
> So, I did my own latency test on a cluster of 3 nodes, and there is a
> significant difference around the 99%’ile and higher for partitions when
> measuring the the ack time when configured for a single ack. The graph
> that I wish I could attach or post clearly shows that around 1/3 of the
> partitions significantly diverge from the other two. So, at least in my
> case, one of my brokers is further than the others.
> Erik
>
> On 9/4/15, 1:06 PM, "Yuheng Du" <yuheng.du.hust@gmail.com> wrote:
>
> >No problem. Thanks for your advice. I think it would be fun to explore. I
> >only know how to program in java though. Hope it will work.
> >
> >On Fri, Sep 4, 2015 at 2:03 PM, Helleren, Erik
> ><Erik.Helleren@cmegroup.com>
> >wrote:
> >
> >> I thing the suggestion is to have partitions/brokers >=1, so 32 should
> >>be
> >> enough.
> >>
> >> As for latency tests, there isn’t a lot of code to do a latency test.
> >>If
> >> you just want to measure ack time its around 100 lines. I will try to
> >> push out some good latency testing code to github, but my company is
> >> scared of open sourcing code… so it might be a while…
> >> Erik
> >>
> >>
> >> On 9/4/15, 12:55 PM, "Yuheng Du" <yuheng.du.hust@gmail.com> wrote:
> >>
> >> >Thanks for your reply Erik. I am running some more tests according to
> >>your
> >> >suggestions now and I will share with my results here. Is it necessary
> >>to
> >> >use a fixed number of partitions (32 partitions maybe) for my test?
> >> >
> >> >I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are
> >>running
> >> >on individual physical nodes. So I think using at least 32 partitions
> >>will
> >> >make more sense? I have seen latencies increase as the number of
> >> >partitions
> >> >goes up in my experiments.
> >> >
> >> >To get the latency of each event data recorded, are you suggesting
> >>that I
> >> >rewrite my own test program (in Java perhaps) or I can just modify the
> >> >standard test program provided by kafka (
> >> >https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I need
> >>to
> >> >rebuild the source if I modify the standard java test program
> >> >ProducerPerformance provided in kafka, right? Now this standard program
> >> >only has average latencies and percentile latencies but no per event
> >> >latencies.
> >> >
> >> >Thanks.
> >> >
> >> >On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik
> >> ><Erik.Helleren@cmegroup.com>
> >> >wrote:
> >> >
> >> >> That is an excellent question! There are a bunch of ways to monitor
> >> >> jitter and see when that is happening. Here are a few:
> >> >>
> >> >>  You could slice the histogram every few seconds, save it out with
a
> >> >> timestamp, and then look at how they compare. This would be mostly
> >> >> manual, or you can graph line charts of the percentiles over time in
> >> >>excel
> >> >> where each percentile would be a series. If you are using HDR
> >> >>Histogram,
> >> >> you should look at how to use the Recorder class to do this coupled
> >> >>with a
> >> >> ScheduledExecutorService.
> >> >>
> >> >>  You can just save the starting timestamp of the event and the
> >>latency
> >> >>of
> >> >> each event. If you put it into a CSV, you can just load it up into
> >> >>excel
> >> >> and graph as a XY chart. That way you can see every point during the
> >> >> running of your program and you can see trends. You want to be
> >>careful
> >> >> about this one, especially of writing to a file in the callback that
> >> >>kfaka
> >> >> provides.
> >> >>
> >> >> Also, I have noticed that most of the very slow observations are at
> >> >> startup. But don’t trust me, trust the data and share your findings.
> >> >> Also, having a 99.9 percentile provides a pretty good standard for
> >> >>typical
> >> >> poor case performance. Average is borderline useless, 50%’ile is
a
> >> >>better
> >> >> typical case because that’s the number that says “half of events
> >>will be
> >> >> this slow or faster”, or for values that are high like 99.9%’ile,
> >>“0.1%
> >> >>of
> >> >> all events will be slower than this”.
> >> >> Erik
> >> >>
> >> >> On 9/4/15, 12:05 PM, "Yuheng Du" <yuheng.du.hust@gmail.com> wrote:
> >> >>
> >> >> >Thank you Erik! That's is helpful!
> >> >> >
> >> >> >But also I see jitters of the maximum latencies when running the
> >> >> >experiment.
> >> >> >
> >> >> >The average end to acknowledgement latency from producer to broker
> >>is
> >> >> >around 5ms when using 92 producers and 4 brokers, and the 99.9
> >> >>percentile
> >> >> >latency is 58ms, but the maximum latency goes up to 1359 ms. How
to
> >> >>locate
> >> >> >the source of this jitter?
> >> >> >
> >> >> >Thanks.
> >> >> >
> >> >> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
> >> >> ><Erik.Helleren@cmegroup.com>
> >> >> >wrote:
> >> >> >
> >> >> >> WellŠ not to be contrarian, but latency depends much more
on the
> >> >>latency
> >> >> >> between the producer and the broker that is the leader for
the
> >> >>partition
> >> >> >> you are publishing to. At least when your brokers are not
> >>saturated
> >> >> >>with
> >> >> >> messages, and acks are set to 1. If acks are set to ALL,
latency
> >>on
> >> >>an
> >> >> >> nonsaturated kafka cluster will be: Round Trip Latency from
> >> >>producer to
> >> >> >> leader for partition + Max( slowest Round Trip Latency to
a
> >>replicas
> >> >>of
> >> >> >> that partition). If a cluster is saturated with messages,
we
> >>have to
> >> >> >> assume that all partitions receive an equal distribution of
> >>messages
> >> >>to
> >> >> >> avoid linear algebra and queueing theory models. I don¹t
like
> >>linear
> >> >> >> algebra :P
> >> >> >>
> >> >> >> Since you are probably putting all your latencies into a single
> >> >> >>histogram
> >> >> >> per producer, or worse, just an average, this pattern would
have
> >>been
> >> >> >> obscured. Obligatory lecture about measuring latency by Gil
Tene
> >> >> >> (https://www.youtube.com/watch?v=9MKY4KypBzg). To verify
this
> >> >> >>hypothesis,
> >> >> >> you should rewrite the benchmark to plot the latencies for
each
> >> >>write
> >> >> >>to
> >> >> >> a partition for each producer into a histogram. (HRD histogram
is
> >> >>pretty
> >> >> >> good for that). This would give you producers*partitions
> >>histograms,
> >> >> >> which might be unwieldy for that many producers. But wait,
there
> >>is
> >> >> >>hope!
> >> >> >>
> >> >> >> To verify that this hypothesis holds, you just have to see
that
> >>there
> >> >> >>is a
> >> >> >> significant difference between different partitions on a SINGLE
> >> >> >>producing
> >> >> >> client. So, pick one producing client at random and use the
data
> >>from
> >> >> >> that. The easy way to do that is just plot all the partition
> >>latency
> >> >> >> histograms on top of each other in the same plot, that way
you
> >>have a
> >> >> >> pretty plot to show people. If you don¹t want to setup plotting,
> >>you
> >> >> >>can
> >> >> >> just compare the medians (50¹th percentile) of the partitions¹
> >> >> >>histograms.
> >> >> >> If there is a lot of variance, your latency anomaly is explained
> >>by
> >> >> >> brokers 47 being slower than nodes 03! If there isn¹t
a lot of
> >> >> >>variance
> >> >> >> at 50%, look at higher percentiles. And if higher percentiles
for
> >> >>all
> >> >> >>the
> >> >> >> partitions look the same, this hypothesis is disproved.
> >> >> >>
> >> >> >> If you want to make a general statement about latency of writing
> >>to
> >> >> >>kafka,
> >> >> >> you can merge all the histograms into a single histogram and
plot
> >> >>that.
> >> >> >>
> >> >> >> To Yuheng¹s credit, more brokers always results in more
> >>throughput.
> >> >>But
> >> >> >> throughput and latency are two different creatures. Its worth
> >>noting
> >> >> >>that
> >> >> >> kafka is designed to be high throughput first and low latency
> >>second.
> >> >> >>And
> >> >> >> it does a really good job at both.
> >> >> >>
> >> >> >> Disclaimer: I might not like linear algebra, but I do like
> >> >>statistics.
> >> >> >> Let me know if there are topics that need more explanation
above
> >>that
> >> >> >> aren¹t covered by Gil¹s lecture.
> >> >> >> Erik
> >> >> >>
> >> >> >> On 9/4/15, 9:03 AM, "Yuheng Du" <yuheng.du.hust@gmail.com>
wrote:
> >> >> >>
> >> >> >> >When I using 32 partitions, the 4 brokers latency becomes
larger
> >> >>than
> >> >> >>the
> >> >> >> >8
> >> >> >> >brokers latency.
> >> >> >> >
> >> >> >> >So is it always true that using more brokers can give
less
> >>latency
> >> >>when
> >> >> >> >the
> >> >> >> >number of partitions is at least the size of the brokers?
> >> >> >> >
> >> >> >> >Thanks.
> >> >> >> >
> >> >> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du
> >> >><yuheng.du.hust@gmail.com>
> >> >> >> >wrote:
> >> >> >> >
> >> >> >> >> I am running a producer latency test. When using
92 producers
> >>in
> >> >>92
> >> >> >> >> physical node publishing to 4 brokers, the latency
is slightly
> >> >>lower
> >> >> >> >>than
> >> >> >> >> using 8 brokers, I am using 8 partitions for the
topic.
> >> >> >> >>
> >> >> >> >> I have rerun the test and it gives me the same result,
the 4
> >> >>brokers
> >> >> >> >> scenario still has lower latency than the 8 brokers
scenarios.
> >> >> >> >>
> >> >> >> >> It is weird because I tested 1broker, 2 brokers,
4 brokers, 8
> >> >> >>brokers,
> >> >> >> >>16
> >> >> >> >> brokers and 32 brokers. For the rest of the case
the latency
> >> >> >>decreases
> >> >> >> >>as
> >> >> >> >> the number of brokers increase.
> >> >> >> >>
> >> >> >> >> 4 brokers/8 brokers is the only pair that doesn't
satisfy this
> >> >>rule.
> >> >> >> >>What
> >> >> >> >> could be the cause?
> >> >> >> >>
> >> >> >> >> I am using a 200 bytes message, the test let each
producer
> >> >>publishes
> >> >> >> >>500k
> >> >> >> >> messages to a given topic. Every test run when I
change the
> >> >>number of
> >> >> >> >> brokers, I use a new topic.
> >> >> >> >>
> >> >> >> >> Thanks for any advices.
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >>
> >>
>
>
