kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Popiel <apop...@marchex.com>
Subject Partition fetching stalls with 0.9.0 new consumer
Date Mon, 25 Apr 2016 23:40:31 GMT
Hello, folks.

I'm encountering a bizarre situation where it appears that fetching for specific partitions
stalls when using the 0.9.0 new consumer.  I know that no partitions are paused for extended
periods; I issue a resume for all assigned partitions immediately before doing a poll.  Despite
this, I'm ending up with approximately 7 (it varies from 3-9) partitions where no records
are delivered to the consumer, despite records continuing to be published to those partitions.
 As a result, I routinely end up with partition lag in the thousands for this small subset
of partitions, while all other partitions have a lag under twenty.

For scale, I have 3 brokers, 100 partitions, and 16 consumer instances.  Records range from
20k to 160k, typically around  30-40k.  Processing time is mostly linear with record size,
on the order of 1 CPU-second per 6k of record data.  Because of the high processing time,
processing is done multi-threaded across 34 cores, and if processing from a single poll hasn't
completed in the heartbeat interval, I pause all assigned partitions, issue a poll(0) to force
the heartbeat, and then resume all assigned partitions.

When partitions get wedged, bouncing one of the consumer instances (not necessarily the instance
who would receive the partitions) will often unwedge the partitions that were wedged... but
then other partitions get wedged, instead.

I have more than sufficient CPU to process all the records, and much of the consumer instance
time is spent waiting on a poll(60000) result which doesn't return anything from the partitions
that are wedged.  Also, my brokers seem to be running cold, with less than 30% CPU utilization
and less than 2MB/sec disk i/o.

Has anyone seen anything like this?  Is it normal for the consumer fetcher to be biased in
which partitions it fetches from?  Are there any suggestions on how to diagnose further?

- Alex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message