kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetić <otis.gospodne...@gmail.com>
Subject Re: Consumer lag lies - orphaned offsets?
Date Fri, 05 Jun 2015 03:13:19 GMT
Hi Jason,

On Thu, Jun 4, 2015 at 5:55 PM, Jason Rosenberg <jbr@squareup.com> wrote:

> what code do you use to poll the jmx end-point?


I used https://github.com/sematext/jmxc to get a "dump" of Consumer JMX.
It's a simple jar that you call with java -jar jmxc.jar PID > jmx.dump .

Otis


> On Thu, Jun 4, 2015 at 4:55 PM, Otis Gospodnetić <
> otis.gospodnetic@gmail.com
> > wrote:
>
> > Hi,
> >
> > On Thu, Jun 4, 2015 at 4:26 PM, Scott Reynolds <sreynolds@twilio.com>
> > wrote:
> >
> > > I believe the JMX metrics reflect the consumer PRIOR to committing
> > offsets
> > > to Kafka / Zookeeper. But when you query from the command line using
> the
> > > kafka tools, you are just getting the committed offsets.
> > >
> >
> > Even if that were the case, and maybe it is, it doesn't explain why the
> > ConsumerLag in JMX often remains *completely constant*.....forever...
> until
> > the consumer is restarted.  You see what I mean?
> >
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> > > On Thu, Jun 4, 2015 at 1:23 PM, Otis Gospodnetic <
> > > otis.gospodnetic@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Here's something potentially useful.
> > > >
> > > > 1) Before: https://apps.sematext.com/spm-reports/s/eQ9WhLegW9 - the
> > > "flat
> > > > Lag situation"
> > > >
> > > > 2) I restarted the consumer whose lag is shown in the above graph
> > > >
> > > > 3) After restart: https://apps.sematext.com/spm-reports/s/4YGkcUP9ms
> -
> > > NO
> > > > lag at all!?
> > > >
> > > > So that 81560 Lag value that was stuck in JMX is gone.  Went down to
> 0.
> > > > Kind of makes sense - the whole consumer was restarted, consumer/java
> > > > process was restarted, everything that was in JMX got reset, and if
> > there
> > > > is truly no consumer lag it makes sense that the values in JMX are 0.
> > > >
> > > > HOWEVER, is the Lag *really* always *exactly* 0?  No way.  Look what
> > > Offset
> > > > Checker shows for this one consumer:
> > > >
> > > > af_servers      spm_cluster_topic-new-cdh      18  220551962
> > > > 220586078       34116
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      19  161936440
> > > > 161960377       23937
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      20  248308642
> > > > 248340350       31708
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      21  259901355
> > > > 259934911       33556
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      22  205274547
> > > > 205296950       22403
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      23  167756654
> > > > 167780028       23374
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      24  357517989
> > > > 357574627       56638
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > > af_servers      spm_cluster_topic-new-cdh      25  194313232
> > > > 194338154       24922
> > > > af_servers_spm-afs-6.prod.sematext-1433447997963-c40182c8-0
> > > >
> > > > You can see Lag is NOT == 0 for any of the partitions!
> > > >
> > > > And yet, look what JMX is showing now -- all Lag for all partitions
> is
> > > > supposedly 0. Always:
> > > >
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=17
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=18
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=18
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=18
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=18
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=19
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=19
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=19
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=19
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=20
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=20
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=20
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=20
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=21
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=21
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=21
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=21
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=22
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=22
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=22
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=22
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=23
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=23
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=23
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=23
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=24
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=24
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=24
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-2-new-cdh,partition=25
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-3-new-cdh,partition=25
> > > >       Value java.lang.Object = 0
> > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=25
> > > >       Value java.lang.Object = 0
> > > >
> > > > Somebody's lying here. :)
> > > >
> > > > Any ideas?
> > > >
> > > > Thanks,
> > > > Otis
> > > > --
> > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> Management
> > > > Solr & Elasticsearch Support * http://sematext.com/
> > > >
> > > >
> > > >
> > > > On Thu, Jun 4, 2015 at 3:54 PM, Otis Gospodnetic <
> > > > otis.gospodnetic@gmail.com
> > > > > wrote:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > (note: Kafka 0.8.2. GA)
> > > > > Got some new info below!  Could be a Kafka metrics bug....
> > > > >
> > > > > On Thu, Jun 4, 2015 at 2:11 PM, Jason Rosenberg <jbr@squareup.com>
> > > > wrote:
> > > > >
> > > > >> I assume you are looking at a 'MaxLag' metric, which reports
the
> > worst
> > > > >> case
> > > > >> lag over a set of partitions.
> > > > >
> > > > >
> > > > > No, we're looking at MBeans that look like this one:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_free_system_topic-new-cdh,partition=10
> > > > >       Value java.lang.Object = 0
> > > > >
> > > > >
> > > > >> Are you consuming multiple partitions, and maybe one of them
is
> > stuck?
> > > > >>
> > > > >
> > > > > Don't think so...  Maybe what we are seeing is a Kafka bug.
> > > > >
> > > > > Here is what we just discovered:
> > > > >
> > > > > Dumped JMX on the consumer and we see this:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=af_servers,topic=spm_cluster_topic-new-cdh,partition=24
> > > > >       Value java.lang.Object = 81560
> > > > >
> > > > > This 81560 is also what we see in SPM - see the flat line here:
> > > > > https://apps.sematext.com/spm-reports/s/eQ9WhLegW9
> > > > > (you can hover over a datapoint on that 81K line to see server
> name,
> > > > > topic, and partition)
> > > > >
> > > > > This 81560 is just not going down.  If I look at JMX in 5 minutes,
> it
> > > > will
> > > > > show the same value - the ConsumerLag of 81560!
> > > > >
> > > > > BUT, this gives different numbers:
> > > > >
> > > > >  /usr/lib/kafka_2.8.0-0.8.1.1/bin/kafka-run-class.sh
> > > > > kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181
> --group
> > > > > af_servers | grep spm_cluster_topic-new-cdh
> > > > >
> > > > > af_servers      spm_cluster_topic-new-cdh      24  *355209634*
> > > > > *355249858*       40224
> > > > > af_servers_spm-afs-6.prod.sematext-1433430424202-e366dfdf-0
> > > > >
> > > > > The delta between the bolded numbers is NOT 81560.  And if I run
> this
> > > > > command N times the delta keeps going down, because the consumer
is
> > > > > catching up.  Just like you'd expect.
> > > > >
> > > > > But the JMX number remains constant <== could this be a Kafka
> > > metrics/JMX
> > > > > bug?
> > > > >
> > > > > Thanks,
> > > > > Otis
> > > > > --
> > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> > Management
> > > > > Solr & Elasticsearch Support * http://sematext.com/
> > > > >
> > > > >
> > > > >
> > > > >>
> > > > >> On Tue, Jun 2, 2015 at 4:00 PM, Otis Gospodnetic <
> > > > >> otis.gospodnetic@gmail.com
> > > > >> > wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > I've noticed that when we restart our Kafka consumers our
> consumer
> > > lag
> > > > >> > metric sometimes looks "weird".
> > > > >> >
> > > > >> > Here's an example:
> > > https://apps.sematext.com/spm-reports/s/0Hq5zNb4hH
> > > > >> >
> > > > >> > You can see lag go up around 15:00, when some consumers
were
> > > > restarted.
> > > > >> > The "weird" thing is that the lag remains flat!
> > > > >> > How could it remain flat if consumers are running? (they
have
> > enough
> > > > >> juice
> > > > >> > to catch up!)
> > > > >> >
> > > > >> > What I think is happening is this:
> > > > >> > 1) consumers are initially not really lagging
> > > > >> > 2) consumers get stopped
> > > > >> > 3) lag grows
> > > > >> > 4) consumers get started again
> > > > >> > 5) something shifts around...not sure what...
> > > > >> > 6) consumers start consuming, and there is actually no lag,
but
> > the
> > > > >> offsets
> > > > >> > written to ZK sometime during 3) don't get updated because
after
> > > > restart
> > > > >> > consumers are reading from somewhere else, not from partition(s)
> > > whose
> > > > >> lag
> > > > >> > and offset delta jumped during 3)
> > > > >> >
> > > > >> > Oh, and:
> > > > >> > 7) Kafka JMX still exposes all offsets, event those for
> partitions
> > > > that
> > > > >> are
> > > > >> > no longer being read, so the consumer lag metric remains
> > > > constant/flat,
> > > > >> > even though consumers are actually not lagging on partitions
> from
> > > > which
> > > > >> > they are now consuming.
> > > > >> >
> > > > >> > What bugs me is 7), because reading lag info from JMX looks
like
> > > it's
> > > > >> > "lying".
> > > > >> >
> > > > >> > Does this sound crazy or reasonable?
> > > > >> >
> > > > >> > If anyone has any comments/advice/suggestions for what one
can
> do
> > > > about
> > > > >> > this, I'm all ears!
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Otis
> > > > >> > --
> > > > >> > Monitoring * Alerting * Anomaly Detection * Centralized
Log
> > > Management
> > > > >> > Solr & Elasticsearch Support * http://sematext.com/
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message