kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: Consumer lag lies - orphaned offsets?
Date Thu, 04 Jun 2015 20:23:09 GMT
Hi,

Here's something potentially useful.

1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the "flat
Lag situation"

2) I restarted the consumer whose lag is shown in the above graph

3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms - NO
lag at all!?

So that 81560 Lag value that was stuck in JMX is gone.  Went down to 0.
Kind of makes sense - the whole consumer was restarted, consumer/java
process was restarted, everything that was in JMX got reset, and if there
is truly no consumer lag it makes sense that the values in JMX are 0.

HOWEVER, is the Lag *really* always *exactly* 0?  No way.  Look what Offset
Checker shows for this one consumer:

af_servers      spm_cluster_topic-new-cdh      18  220551962
220586078       34116
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers      spm_cluster_topic-new-cdh      19  161936440
161960377       23937
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers      spm_cluster_topic-new-cdh      20  248308642
248340350       31708
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers      spm_cluster_topic-new-cdh      21  259901355
259934911       33556
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers      spm_cluster_topic-new-cdh      22  205274547
205296950       22403
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers      spm_cluster_topic-new-cdh      23  167756654
167780028       23374
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers      spm_cluster_topic-new-cdh      24  357517989
357574627       56638
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
af_servers      spm_cluster_topic-new-cdh      25  194313232
194338154       24922
af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0

You can see Lag is NOT == 0 for any of the partitions!

And yet, look what JMX is showing now -- all Lag for all partitions is
supposedly 0. Always:

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=21
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=21
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=21
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=21
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=22
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=22
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=22
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=22
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=23
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=23
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=23
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=23
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=24
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=24
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=24
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=25
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=25
      Value java.lang.Object = 0
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=25
      Value java.lang.Object = 0

Somebody's lying here. :)

Any ideas?

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On Thu, Jun 4, 2015 at 3:54 PM, Otis Gospodnetic <otis.gospodnetic@gmail.com
> wrote:

> Hi Jason,
>
> (note: Kafka 0.8.2. GA)
> Got some new info below!  Could be a Kafka metrics bug....
>
> On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg <jbr@squareup.com> wrote:
>
>> I assume you are looking at a 'MaxLag' metric, which reports the worst
>> case
>> lag over a set of partitions.
>
>
> No, we're looking at MBeans that look like this one:
>
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10
>       Value java.lang.Object = 0
>
>
>> Are you consuming multiple partitions, and maybe one of them is stuck?
>>
>
> Don't think so...  Maybe what we are seeing is a Kafka bug.
>
> Here is what we just discovered:
>
> Dumped JMX on the consumer and we see this:
>
>
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
>       Value java.lang.Object = 81560
>
> This 81560 is also what we see in SPM - see the flat line here:
> https://apps.sematext.com/spm-reports/s/eQ9WhLegW9
> (you can hover over a datapoint on that 81K line to see server name,
> topic, and partition)
>
> This 81560 is just not going down.  If I look at JMX in 5 minutes, it will
> show the same value - the ConsumerLag of 81560!
>
> BUT, this gives different numbers:
>
>  /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh
> kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group
> af_servers | grep spm_cluster_topic-new-cdh
>
> af_servers      spm_cluster_topic-new-cdh      24  *355209634*
> *355249858*       40224
> af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0
>
> The delta between the bolded numbers is NOT 81560.  And if I run this
> command N times the delta keeps going down, because the consumer is
> catching up.  Just like you'd expect.
>
> But the JMX number remains constant <== could this be a Kafka metrics/JMX
> bug?
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>>
>> On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic <
>> otis.gospodnetic@gmail.com
>> > wrote:
>>
>> > Hi,
>> >
>> > I've noticed that when we restart our Kafka consumers our consumer lag
>> > metric sometimes looks "weird".
>> >
>> > Here's an example: https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH
>> >
>> > You can see lag go up around 15:00, when some consumers were restarted.
>> > The "weird" thing is that the lag remains flat!
>> > How could it remain flat if consumers are running? (they have enough
>> juice
>> > to catch up!)
>> >
>> > What I think is happening is this:
>> > 1) consumers are initially not really lagging
>> > 2) consumers get stopped
>> > 3) lag grows
>> > 4) consumers get started again
>> > 5) something shifts around...not sure what...
>> > 6) consumers start consuming, and there is actually no lag, but the
>> offsets
>> > written to ZK sometime during 3) don't get updated because after restart
>> > consumers are reading from somewhere else, not from partition(s) whose
>> lag
>> > and offset delta jumped during 3)
>> >
>> > Oh, and:
>> > 7) Kafka JMX still exposes all offsets, event those for partitions that
>> are
>> > no longer being read, so the consumer lag metric remains constant/flat,
>> > even though consumers are actually not lagging on partitions from which
>> > they are now consuming.
>> >
>> > What bugs me is 7), because reading lag info from JMX looks like it's
>> > "lying".
>> >
>> > Does this sound crazy or reasonable?
>> >
>> > If anyone has any comments/advice/suggestions for what one can do about
>> > this, I'm all ears!
>> >
>> > Thanks,
>> > Otis
>> > --
>> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> > Solr & Elasticsearch Support * http://sematext.com/
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message