kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Olsen <ja...@inaseq.com>
Subject Problems when Consuming from multiple Partitions
Date Thu, 05 Mar 2020 03:42:41 GMT
I’m seeing behaviour that I don’t understand when I have Consumers fetching from multiple
Partitions from the same Topic.  There are two different conditions arising:

1. A subset of the Partitions allocated to a given Consumer not being consumed at all.  The
Consumer appears healthy, the Thread is running and logging activity and is successfully processing
records from some of the Partitions it has been assigned.  I don’t think this is due to
the first Partition fetched filling a Batch (KIP-387).  The problem does not occur if we have
a particular number of Consumers (3 in this case) but it has failed with a range of other
larger values.  I don’t think there is anything special about 3 - it just happens to work
OK with that value although it is the same as the Broker and Replica count.  When we tried
6, 5 Consumers were fine but 1 exhibited this issue.

2. Up to a half second delay between Producer sending and Consumer receiving a message.  This
looks suspiciously like the fetch.max.wait.ms=500 but we also have fetch.min.bytes=1 so should
get messages as soon as something is available.  The only explanation I can think of is if
the fetch.max.wait.ms is applied in full to the first Partition checked and it remains empty
for the duration.  Then it moves on to a subsequent non-empty Partition and delivers messages
from there.

Our environment is AWS MSK (Kafka 2.2.1) and Kafka Java client 2.4.0.

All environments appear healthy and under light load, e.g. clients only operating at a 1-2%
CPU, Brokers (3) at 5-10% CPU.   No swap, no crashes, no dead threads etc.

Typical scenario is a Topic with 60 Partitions, 3 Replicas and a single ConsumerGroup with
5 Consumers.  The Partitioning is for semantic purposes with the intention being to add more
Consumers as the business grows and load increases.  Some of the Partitions are always empty
due to using short string keys and the default Partitioner - we will probably implement a
custom Partitioner to achieve better distribution in the near future.

I don’t have access to the detailed JMX metrics yet but am working on that in the hope it
will help diagnose.

Thoughts and advice appreciated!
View raw message