So are you suggesting that the long delays happened in %1 percentile
happens in the slower partitions that are further away? Thanks.
So, I did my own latency test on a cluster of 3 nodes, and there is a
significant difference around the 99%'ile and higher for partitions when
measuring the the ack time when configured for a single ack. The graph
that I wish I could attach or post clearly shows that around 1/3 of the
partitions significantly diverge from the other two. So, at least in my
case, one of my brokers is further than the others.
Erik
> Erik
On 9/4/15, 1:06 PM, "Yuheng Du" <yuheng.du.hust@gmail.com> wrote:
No problem. Thanks for your advice. I think it would be fun to explore. I
only know how to program in java though. Hope it will work.
> >On Fri, Sep 4, 2015 at 2:03 PM, Helleren, Erik
> ><Erik.Helleren@cmegroup.com>
> >
I thing the suggestion is to have partitions/brokers >=1, so 32 should
be
> >>be
> >> enough.
> >>
> >> As for latency tests, there isn’t a lot of code to do a latency test.
> >>If
you just want to measure ack time its around 100 lines. I will try to
push out some good latency testing code to github, but my company is
scared of open sourcing code… so it might be a while…
Erik
> >> Erik
On 9/4/15, 12:55 PM, "Yuheng Du" <yuheng.du.hust@gmail.com> wrote:
> >>
Thanks for your reply Erik. I am running some more tests according to
your
> >>your
suggestions now and I will share with my results here. Is it necessary
to
> >>to
> >> >use a fixed number of partitions (32 partitions maybe) for my test?
> >> >
> >> >I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are
> >>running
on individual physical nodes. So I think using at least 32 partitions
will
> >>will
make more sense? I have seen latencies increase as the number of
partitions
> >> >goes up in my experiments.
> >> >
> >> >To get the latency of each event data recorded, are you suggesting
> >>that I
rewrite my own test program (in Java perhaps) or I can just modify the
standard test program provided by kafka (
https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I need
to
> >>to
rebuild the source if I modify the standard java test program
ProducerPerformance provided in kafka, right? Now this standard program
only has average latencies and percentile latencies but no per event
latencies.

Thanks.
> >> >
> >> >Thanks.
> >> >
On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik
<Erik.Helleren@cmegroup.com>
> >> ><Erik.Helleren@cmegroup.com>
> >> >wrote:
> >> >
That is an excellent question! There are a bunch of ways to monitor
> >> >> jitter and see when that is happening. Here are a few:
> >> >>
> >> >>  You could slice the histogram every few seconds, save it out with
timestamp, and then look at how they compare. This would be mostly
manual, or you can graph line charts of the percentiles over time in
excel
> >> >>excel
> >> >> where each percentile would be a series. If you are using HDR
> >> >>Histogram,
you should look at how to use the Recorder class to do this coupled
with a
> >> >>with a
> >> >> ScheduledExecutorService.
> >> >>
> >> >>  You can just save the starting timestamp of the event and the
> >>latency
> >> >>of
each event. If you put it into a CSV, you can just load it up into
excel
> >> >>excel
and graph as a XY chart. That way you can see every point during the
running of your program and you can see trends. You want to be
careful
> >>careful
about this one, especially of writing to a file in the callback that
kfaka
> >> >>kfaka
> >> >> provides.
> >> >>
> >> >> Also, I have noticed that most of the very slow observations are at
startup. But don't trust me, trust the data and share your findings.
Also, having a 99.9 percentile provides a pretty good standard for
typical
> >> >>typical
poor case performance. Average is borderline useless, 50%'ile is
a
better
> >> >>better
typical case because that's the number that says "half of events
will be
> >>will be
this slow or faster", or for values that are high like 99.9%'ile,
"0.1%
of
> >>“0.1%
> >> >>of
all events will be slower than this".
Erik
> >> >> Erik
> >> >>
On 9/4/15, 12:05 PM, "Yuheng Du" <yuheng.du.hust@gmail.com> wrote:
> >> >>
> >> >> >Thank you Erik! That's is helpful!
> >> >> >
> >> >> >But also I see jitters of the maximum latencies when running the
> >> >> >experiment.
> >> >> >
> >> >> >The average end to acknowledgement latency from producer to broker
> >>is
around 5ms when using 92 producers and 4 brokers, and the 99.9
percentile
> >> >>percentile
latency is 58ms, but the maximum latency goes up to 1359 ms. How
to
locate
to
> >> >>locate
the source of this jitter?

Thanks.
> >> >> >
> >> >> >Thanks.
> >> >> >
On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik
<Erik.Helleren@cmegroup.com>
> >> >> ><Erik.Helleren@cmegroup.com>
> >> >> >wrote:
> >> >> >
Well… not to be contrarian, but latency depends much more
on the
latency
on the
> >> >>latency
between the producer and the broker that is the leader for
the
partition
the
> >> >>partition
you are publishing to. At least when your brokers are not
saturated
with
> >>saturated
> >> >> >>with
messages, and acks are set to 1. If acks are set to ALL,
latency
on
an
latency
> >>on
> >> >>an
nonsaturated kafka cluster will be: Round Trip Latency from
producer to
> >> >>producer to
leader for partition + Max( slowest Round Trip Latency to
a
replicas
of
a
> >>replicas
> >> >>of
that partition). If a cluster is saturated with messages,
we
have to
we
> >>have to
assume that all partitions receive an equal distribution of
messages
to
> >>messages
> >> >>to
avoid linear algebra and queueing theory models. I don't
like
linear
like
> >>linear
> >> >> >> algebra :P
> >> >> >>
> >> >> >> Since you are probably putting all your latencies into a single
> >> >> >>histogram
per producer, or worse, just an average, this pattern would
have
been
have
> >>been
obscured. Obligatory lecture about measuring latency by Gil
Tene
Tene
(https://www.youtube.com/watch?v=9MKY4KypBzg). To verify
this
hypothesis,
this
> >> >> >>hypothesis,
you should rewrite the benchmark to plot the latencies for
each
write
to
each
> >> >>write
> >> >> >>to
a partition for each producer into a histogram. (HRD histogram
is
pretty
is
> >> >>pretty
good for that). This would give you producers*partitions
histograms,
> >>histograms,
> >> >> >> which might be unwieldy for that many producers. But wait,
there
> >>is
> >> >> >>hope!
> >> >> >>
> >> >> >> To verify that this hypothesis holds, you just have to see
that
> >>there
> >> >> >>is a
significant difference between different partitions on a SINGLE
producing
> >> >> >>producing
client. So, pick one producing client at random and use the
data
from
data
> >>from
that. The easy way to do that is just plot all the partition
latency
> >>latency
histograms on top of each other in the same plot, that way
you
have a
you
> >>have a
pretty plot to show people. If you don't want to setup plotting,
you
can
> >>you
> >> >> >>can
just compare the medians (50'th percentile) of the partitions'
histograms.
> >> >> >>histograms.
If there is a lot of variance, your latency anomaly is explained
by
> >>by
brokers 47 being slower than nodes 03! If there isn't
a lot of
variance
a lot of
> >> >> >>variance
at 50%, look at higher percentiles. And if higher percentiles
for
all
the
for
> >> >>all
> >> >> >>the
> >> >> >> partitions look the same, this hypothesis is disproved.
> >> >> >>
> >> >> >> If you want to make a general statement about latency of writing
> >>to
> >> >> >>kafka,
> >> >> >> you can merge all the histograms into a single histogram and
plot
> >> >>that.
> >> >> >>
> >> >> >> To Yuheng¹s credit, more brokers always results in more
> >>throughput.
> >> >>But
throughput and latency are two different creatures. Its worth
noting
that
> >>noting
> >> >> >>that
kafka is designed to be high throughput first and low latency
second.
And
> >>second.
> >> >> >>And
> >> >> >> it does a really good job at both.
> >> >> >>
> >> >> >> Disclaimer: I might not like linear algebra, but I do like
> >> >>statistics.
Let me know if there are topics that need more explanation
above
that
above
> >>that
aren't covered by Gil's lecture.
Erik
> >> >> >> Erik
> >> >> >>
On 9/4/15, 9:03 AM, "Yuheng Du" <yuheng.du.hust@gmail.com>
wrote:
wrote:
> >> >> >>
When I using 32 partitions, the 4 brokers latency becomes
larger
than
the
larger
> >> >>than
> >> >> >>the
> >> >> >> >8
> >> >> >> >brokers latency.
> >> >> >> >
> >> >> >> >So is it always true that using more brokers can give
less
> >>latency
> >> >>when
> >> >> >> >the
number of partitions is at least the size of the brokers?

Thanks.
> >> >> >> >
> >> >> >> >Thanks.
> >> >> >> >
On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du
<yuheng.du.hust@gmail.com>
> >> >><yuheng.du.hust@gmail.com>
> >> >> >> >wrote:
> >> >> >> >
I am running a producer latency test. When using
92 producers
in
92
92 producers
> >>in
> >> >>92
physical node publishing to 4 brokers, the latency
is slightly
lower
than
is slightly
> >> >>lower
> >> >> >> >>than
> >> >> >> >> using 8 brokers, I am using 8 partitions for the
topic.
> >> >> >> >>
> >> >> >> >> I have rerun the test and it gives me the same result,
the 4
> >> >>brokers
> >> >> >> >> scenario still has lower latency than the 8 brokers
scenarios.
> >> >> >> >>
> >> >> >> >> It is weird because I tested 1broker, 2 brokers,
4 brokers, 8
> >> >> >>brokers,
> >> >> >> >>16
brokers and 32 brokers. For the rest of the case
the latency
decreases
as
the latency
> >> >> >>decreases
> >> >> >> >>as
> >> >> >> >> the number of brokers increase.
> >> >> >> >>
> >> >> >> >> 4 brokers/8 brokers is the only pair that doesn't
satisfy this
> >> >>rule.
> >> >> >> >>What
> >> >> >> >> could be the cause?
> >> >> >> >>
> >> >> >> >> I am using a 200 bytes message, the test let each
producer
> >> >>publishes
> >> >> >> >>500k
messages to a given topic. Every test run when I
change the
number of
change the
> >> >>number of
brokers, I use a new topic.

Thanks for any advices.
> >> >> >> >>
> >> >> >> >> Thanks for any advices.
> >> >> >> >>
