kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Popiel <apop...@marchex.com>
Subject RE: Partition fetching stalls with 0.9.0 new consumer
Date Thu, 28 Apr 2016 20:45:58 GMT
Hello, Robert.

I upgraded to 0.9.0.1, and (after baking for a day and a half) confirm that the issue is now
resolved.  KAFKA-2978 is likely the culprit.

Thanks,
- Alex

-----Original Message-----
From: Underwood, Robert [mailto:Robert.Underwood@inin.com] 
Sent: Tuesday, April 26, 2016 2:51 PM
To: users@kafka.apache.org
Subject: Re: Partition fetching stalls with 0.9.0 new consumer

You may be hitting https://issues.apache.org/jira/browse/KAFKA-2978, if you're using 0.9.0.0

________________________________________
From: Alex Popiel <apopiel@marchex.com>
Sent: Monday, April 25, 2016 7:40 PM
To: 'users@kafka.apache.org'
Subject: Partition fetching stalls with 0.9.0 new consumer

Hello, folks.

I'm encountering a bizarre situation where it appears that fetching for specific partitions
stalls when using the 0.9.0 new consumer.  I know that no partitions are paused for extended
periods; I issue a resume for all assigned partitions immediately before doing a poll.  Despite
this, I'm ending up with approximately 7 (it varies from 3-9) partitions where no records
are delivered to the consumer, despite records continuing to be published to those partitions.
 As a result, I routinely end up with partition lag in the thousands for this small subset
of partitions, while all other partitions have a lag under twenty.

For scale, I have 3 brokers, 100 partitions, and 16 consumer instances.  Records range from
20k to 160k, typically around  30-40k.  Processing time is mostly linear with record size,
on the order of 1 CPU-second per 6k of record data.  Because of the high processing time,
processing is done multi-threaded across 34 cores, and if processing from a single poll hasn't
completed in the heartbeat interval, I pause all assigned partitions, issue a poll(0) to force
the heartbeat, and then resume all assigned partitions.

When partitions get wedged, bouncing one of the consumer instances (not necessarily the instance
who would receive the partitions) will often unwedge the partitions that were wedged... but
then other partitions get wedged, instead.

I have more than sufficient CPU to process all the records, and much of the consumer instance
time is spent waiting on a poll(60000) result which doesn't return anything from the partitions
that are wedged.  Also, my brokers seem to be running cold, with less than 30% CPU utilization
and less than 2MB/sec disk i/o.

Has anyone seen anything like this?  Is it normal for the consumer fetcher to be biased in
which partitions it fetches from?  Are there any suggestions on how to diagnose further?

- Alex

Mime
View raw message