samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Shaw <>
Subject container concurrency and pipelining
Date Fri, 06 Feb 2015 08:00:53 GMT
Hi everyone,
I've done some raw Disk, Kafka and Samza benchmarking. I peaked out a
single Samza container's consumer at around 2MB/s. Running a Kafka Consumer
Perf test though on the same machine I can do 100's of MB/s. It seems like
most of the bottleneck exists in the Kafka async client. There appears to
be only 1 thread in the Kafka client rather than a thread pool and due to
the limitation that a container can't run on multiple cores this thread
gets scheduled I assume on the same core as the consumer and process call.

I know a lot thought has been put into the design of maintaining parity
between task instances and partitions and preventing unpredictable behavior
from a threaded system. A reasonable solution might be to just add
partitions and increase container count with the partition count. This is
at the cost of increasing memory usage on the node managers necessarily due
to the increased container count.

Has there been any design discussions into allowing multiple cores on on a
single container to allow better pipelining within the container to get
better throughput and also introducing a thread pool outside of Kafka's
client to allow concurrent produces to Kafka within the same container? I
understand there are ordering concerns with this concurrency and for those
sensitive use cases the thread pool could be 1 but for use cases where
ordering is less important and raw throughput is more of a concern they can
achieve that with allowing current async produces. I also know that Kafka
has plans to rework their producer but I haven't been able to find if this
includes introducing a thread pool to allow multiple async produces.
Lastly, has anyone been able to get more MB/s out of a container than what
I have? Thanks!


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message