storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nikos R. Katsipoulakis" <nick.kat...@gmail.com>
Subject Re: Complete Latency Vs. Throughput--when do they not change in same direction?
Date Fri, 01 Apr 2016 16:34:02 GMT
Hello again John,

No need to apologize. All experiments in distributed environments have so
many details, that it is only normal to forget some of them in an initial
explanation.

Going back to my point on seeing less throughput with less latency, I will
give you an example:

Assume you own a system, which has the resources to handle up to 100
events/sec and guarantees that you get a mean latency of 5 msec. If you run
an experiment in which you send in 100 events/sec, your system runs in a
100% capacity and you monitor 5 msec end-to-end latency. Your throughput is
expected to be somewhere close to 100 events/sec (but lower if you factor
in latency). Now, if you run another experiment in which you send in 50
events/sec, your system runs at 50% capacity and you monitor an average
end-to-end latency somewhere around 2.5 msec. In the second experiment, you
are expected to see lower throughput compared to the first experiment and
somewhere around 50 events/sec.

Of course, in the example above I assumed that there is a 1:1 mapping
between each input data point and each output. If that is not the case,
then you have to give more details.

Thanks,
Nikos

On Fri, Apr 1, 2016 at 12:20 PM, John Yost <hokiegeek2@gmail.com> wrote:

> Hey Nikos,
>
> Thanks for responding so quickly, and I apologize for leaving out a
> crucially important detail--the Kafka topic. My topo is reading from a
> static topic.  I definitely agree that reading from a live topic could--and
> would likely--lead to variable throughput rates, both in terms of raw input
> rates as well as variability in the content. Again, great question and
> points, I should have specified my topo is reading from a static Kafka
> topic in my original post.
>
> Regarding your third point, my thinking is that throughput would go up if
> Complete Latency went down since its my understanding that Complete Latency
> measures the avg amount of time that each tuple spends in the topology. The
> key if here is if the input rate stays the same. If Complete Latency
> decreases, more tuples can be processed by the topology in a given amount
> time. But I see what you're saying the avg time spent on each tuple would
> be less if the input rate goes up because there's more data per second,
> more context switching amongst the executors, etc... Please confirm if I am
> thinking about this the wrong way, because this seems to be a pretty
> fundamental fact about Storm that I need to have right.
>
> Great point regarding waiting for topology to complete warm up.  I let my
> topo run for 20 minutes before measuring anything.
>
> Thanks
>
> --John
>
>
>
> On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
>> Hello John,
>>
>> I have to say that a system's telemetry is not a mystery easily
>> understood. Then, let us try to deduce what might be the case in your
>> use-case that causes inconsistent performance metrics.
>>
>> At first, I would like to ask if your KafkaSpout's produce tuples with
>> the same rate. In other words, do you produce or read data in a
>> deterministic (replay-able) way; or do you attach your KafkaSpout to a
>> non-controllable source of data (like Twitter feed, news feed etc)? The
>> reason I am asking is because figuring out what happens in the source of
>> your data (in terms of input rate) is really important. If your use-case
>> involves varying input-rate for your sources, I would suggest picking a
>> particular snapshot of that source, and replay your experiments in order to
>> check if the variance in latency/throughput still exists.
>>
>> The second point I would like to make is that sometimes throughput (or
>> ack-rate as you correctly put it) might be related to the data you are
>> pushing. For instance, a computation-heavy task might take more time for a
>> particular value distribution than for another. Therefore, please make sure
>> that the data you send in the system always cause the same amount of
>> computation.
>>
>> And third, noticing dropping throughput and latency at the same time
>> immediately points to a dropped input rate. Think about it. If I send in
>> tuples with a lower input rate, I expect throughput to drop (since I am
>> sending tuples with a lower input rate), and at the same time the heavy
>> computation has to work with less data (thus end-to-end latency also
>> drops). Does the previous make sense to you? Can you verify that among the
>> different runs, you had consistent input rates?
>>
>> Finally, I would suggest to you that you let Storm warm-up and drop your
>> initial metrics. In my experience with Storm, latency and throughput, in
>> the beginning of a task (until all buffers get full), are highly variable,
>> and therefore, not reliable data points to include in your analysis. You
>> can verify my claim by doing an overtime plot of your data.
>>
>> Thanks,
>> Nikos
>>
>> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <hokiegeek2@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I am a little puzzled by what I am seeing in some testing with a
>>> topology I have where the topo is reading from a KafkaSpout, doing some CPU
>>> intensive processing, and then writing out to Kafka via the standard
>>> KafkaBolt.
>>>
>>> I am doing testing in a multi-tenant environment and so test results can
>>> vary by 10-20% on average.  However, results are much more variable the
>>> last couple of days.
>>>
>>> The big thing I am noticing: whereas the throughput--as measured in
>>> tuples acked/minute--is half today of what it was yesterday for the same
>>> configuraton, the Complete Latency (total time a tuple is in the topology
>>> from the time it hits the KafkaSpout to the time it is acked in the
>>> KafkaBolt) today is a third of what it was yesterday.
>>>
>>> Any ideas as to how the throughput could go down dramatically at the
>>> same time the Complete Latency is improving?
>>>
>>> Thanks
>>>
>>> --John
>>>
>>
>>
>>
>> --
>> Nikos R. Katsipoulakis,
>> Department of Computer Science
>> University of Pittsburgh
>>
>
>


-- 
Nikos R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh

Mime
View raw message