kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Santilli <jonathansanti...@gmail.com>
Subject Re: Repeating UNKNOWN_PRODUCER_ID errors for Kafka streams applications
Date Tue, 11 Jun 2019 12:13:47 GMT
Hello,

we currently see the same WARN logs when the App is processing low rate of
records, as soon the app starts processing a high volume of records,
those WARN stop showing in the logs.
According to other email threads, this should be fixed with
https://issues.apache.org/jira/browse/KAFKA-7190.

We use version 2.2.1, with EOS

Cheers!
--
Jonathan


On Tue, Jun 11, 2019 at 11:04 AM Pieter Hameete <pieter.hameete@blockbax.com>
wrote:

> Hi Guozhang,
>
> Thanks that clarifies a lot. The segment size or segment ms is not the
> issue here then, changing these properties does not make the errors go away
> either 😊
>
> Regarding the proposed solutions:
>
> 1) We use EOS, so the commit interval is defaulted to 100 ms in our case.
> We don't want to delay consumers that respect transactions in their
> consumption of the produced events so increasing this config is not a good
> option for us. We have tried temporarily increasing it to 30s and indeed
> partitions receiving records at a higher frequency (i.e. 1 per 5 seconds)
> stopped resulting in errors. Partitions receiving records at a lower (i.e.
> 1 per minute) frequency were still resulting in errors.
>
> 2) We have a small disparity in keys, so the data is already spread over a
> small (3 or 4) number of partitions. I don't think reducing the number of
> partitions to for example 4 will make a difference in this case?
>
> My attempt to summarize the issue here would be the following:
>
> If exactly once is enabled and records are produced on a partition less
> frequent than the commit interval, then the brokers/producers will produce
> an UNKNOWN_PRODUCER_ID error/warning on each produced record and the
> producers need to retry once for every produced record. This happens
> because commits trigger truncation of the producer ID map causing producers
> that did not produce any messages since last commit to have their ID
> removed.
>
> In our case we use a Kafka Streams application with exactly once enabled
> and a default commit.interval.ms of 100ms. We don't wish to increase this
> default to allow the consumers to be as real-time as possible. Even if we
> would scale down to 1 partition (which is undesirable) I don't think the
> producers would produce a record to this one partition between every offset
> commit. Basically we can't resolve this problem unless we would have a
> volume large enough to produce a record on each partition between every
> offset commit.
>
> Our only options seem to be to move from exactly once to at least once, or
> to wait for
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer.
>
>
> Best,
>
> Pieter
>
> -----Oorspronkelijk bericht-----
> Van: Guozhang Wang <wangguoz@gmail.com>
> Verzonden: Monday, 10 June 2019 22:19
> Aan: users@kafka.apache.org
> Onderwerp: Re: Repeating UNKNOWN_PRODUCER_ID errors for Kafka streams
> applications
>
> Hi Pieter,
>
> My reasoning is that the most recent segment (called the `active`
> segment`) would not be deleted for immediately since it is still being
> appended. I.e.
> say you have two segments offset range at [0, 100), [100, 180). And if the
> delete records API is issued to delete any records older than 150, then the
> first segment would be deleted, but the second (and the active one) would
> not be deleted. Instead we would only set the log-starting-offset at 150.
>
> But when I double checked the code again, I realized that even if we do
> not delete the segment, we would still "truncate" the maintained producer
> ID map, and hence what you've seen is correct: although the data cleaner
> did not delete the segment file yet, corresponding producer ID would still
> be removed.
>
> So for now what I can think of to reduce this issue, are that you can
> either consider 1) increase your commit interval, and hence reduce the
> delete-records frequency, or 2) decrease your num.partitions so that the
> per-partition traffic from a single producer can be reasonably large.
>
>
> Guozhang
>
>
> On Fri, Jun 7, 2019 at 2:09 AM Pieter Hameete <pieter.hameete@blockbax.com
> >
> wrote:
>
> > Hi Guozhang,
> >
> > I've checked the log-cleaner.log on the brokers for cleanup of
> > repartition topics for the event-rule-engine application, but I found
> > no reports of frequent cleanup there.
> >
> > Also forgive me if I am wrong, because I don’t understand the exact
> > mechanics of the log cleanup processes. But didn't KAFKA-6150 (
> > https://github.com/apache/kafka/pull/4270/) introduce a log cleanup
> > mechanic for repartition topics that cleans up the logs more
> > aggressively (i.e. not following the regular LogCleaner cleanup
> > process that is triggered by segment.ms and segment.bytes) by deleting
> > records via the admin client based on committed input topic offsets.
> > This would explain producer id's not being known if the logs get
> > cleaned up almost instantly via this process.
> >
> > Best,
> >
> > Pieter
> >
> > -----Oorspronkelijk bericht-----
> > Van: Guozhang Wang <wangguoz@gmail.com>
> > Verzonden: Thursday, 6 June 2019 18:16
> > Aan: users@kafka.apache.org
> > Onderwerp: Re: Repeating UNKNOWN_PRODUCER_ID errors for Kafka streams
> > applications
> >
> > Pieter,
> >
> > KIP-360 should be able to fix it, yes, but it won't be completed soon,
> > and the earliest release it may get in is in 2.4. At the mean time, I
> > thought that be increasing the segment size, hence reducing the
> > frequency that records being truncated and hence causing producer ID
> > to be removed to be much smaller.
> >
> > If you look at the broker-side logs, could you see if the log cleanup
> > threads logs that truncating the
> > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartition-xx
> > partitions? If it is too frequent that would cause the producer ID
> > stored on brokers to be removed too early and therefore
> > UNKNOWN_PRODUCER_ID, but overriding the segment size / ms should
> > prevent it from being triggered too often.
> >
> > Guozhang
> >
> >
> > On Thu, Jun 6, 2019 at 5:10 AM Pieter Hameete
> > <pieter.hameete@blockbax.com
> > >
> > wrote:
> >
> > > Hello,
> > >
> > > We also found this earlier email in the archives which looks very
> > > much like what we are experiencing:
> > >
> > >
> > > http://mail-archives.apache.org/mod_mbox/kafka-users/201811.mbox/%3C
> > > CA
> > > M0VdefApmc5wBZQaJmQtbcnZ_OOgGv84qCuPoJS-KU4B=EwtA@mail.gmail.com%3E
> > >
> > >  So it seems like:
> > >
> > > * It only happens with EXACTLY_ONCE processing
> > > * It only happens on repartition topics (probably due to the changes
> > > in
> > > https://issues.apache.org/jira/browse/KAFKA-6150)
> > > * The errors are happening continuously (on each produced message)
> > >
> > > The email I referred to points to
> > > https://issues.apache.org/jira/browse/KAFKA-7190 for ongoing work on
> > > solving this issue. However, it seems that this 7190 solves the
> > > UNKNOWN_PRODUCER_ID with a different cause (low traffic). The hotfix
> > > of changing the retention settings on the topic does not seem to
> > > resolve the errors in our case.
> > >
> > > Do you think the changes listed under KIP-360
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve
> > > +h
> > > andling+of+unknown+producer
> > > could solve what we are seeing here?
> > >
> > > Best,
> > >
> > > Pieter
> > >
> > > -----Oorspronkelijk bericht-----
> > > Van: Pieter Hameete <pieter.hameete@blockbax.com>
> > > Verzonden: Thursday, 6 June 2019 13:39
> > > Aan: users@kafka.apache.org
> > > Onderwerp: RE: Repeating UNKNOWN_PRODUCER_ID errors for Kafka
> > > streams applications
> > >
> > > Hi Guozhang, Matthias,
> > >
> > > @Guozhang The brokers are on version 2.2.0-cp2 and the clients are
> > > version 2.2.1. We have upgraded yesterday from 2.1.1 to this version
> > > to see if it would make a difference, but unfortunately not. After
> > > restarting the brokers and streams applications the broker errors
> > > and client warnings continued as before.
> > >
> > > @Matthias I've increased the segment.bytes to be 10 times the
> > > default but this does not change anything.
> > >
> > > For my understanding: why are retention settings important here?
> > > Because the producer which is logging the UNKNOWN_PRODUCER_ID
> > > exception produced a message 5 seconds earlier (with the same
> > > exception though..). I remember from one of your comments (Matthias)
> > > that the brokers build up the set of known producers from the log.
> > > If a message was produced on a partition 5 seconds before by the
> > > same producer (confirmed via kafkacat), how can it be the broker
> > > throws an
> > UNKNOWN_PRODUCER_ID exception?
> > >
> > > Best,
> > >
> > > Pieter
> > >
> > > -----Oorspronkelijk bericht-----
> > > Van: Guozhang Wang <wangguoz@gmail.com>
> > > Verzonden: Thursday, 6 June 2019 01:39
> > > Aan: users@kafka.apache.org
> > > Onderwerp: Re: Repeating UNKNOWN_PRODUCER_ID errors for Kafka
> > > streams applications
> > >
> > > Hello Pieter,
> > >
> > > It's normal to see offset increasing by 2: that is because each
> > > message on that partition is from a separate transaction which
> > > Streams EOS leverage on, and whenever a a txn commits a commit
> > > marker, written as a special message would be appended but not exposed
> to consumers.
> > > And hence that txn marker will take one offset as well.
> > >
> > > Your overridden configs look good to me, and it's indeed weird why
> > > it did not mitigate your observed issue. Which Kafka version did you
> use?
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Wed, Jun 5, 2019 at 4:43 AM Pieter Hameete
> > > <pieter.hameete@blockbax.com
> > > >
> > > wrote:
> > >
> > > > Hi Guozhang,
> > > >
> > > > Some additional finding: it seems to only happen on Kakfa Streams
> > > > repartition topics. We haven't seen this happening for any other
> > > > topics so far.
> > > >
> > > > Best,
> > > >
> > > > Pieter
> > > >
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Pieter Hameete <pieter.hameete@blockbax.com>
> > > > Verzonden: Wednesday, 5 June 2019 11:23
> > > > Aan: users@kafka.apache.org
> > > > Onderwerp: RE: Repeating UNKNOWN_PRODUCER_ID errors for Kafka
> > > > streams applications
> > > >
> > > > Hi Guozhang,
> > > >
> > > > Thanks for your reply! I noticed my original mail went out twice
> > > > by accident, sorry for that.
> > > >
> > > > We currently have a small variety of keys so not all partitions
> > > > are 'actively used' indeed. The strange thing is though is that
> > > > the errors occur for the partitions that actively receive records
> > > > every few seconds. I have checked this using kafkacat to consume
> > > > the specific
> > > partitions.
> > > > Something I noticed was that for each received record the
> > > > partition offset was 2 higher than the previous record, instead of
> > > > the expected 1. Could that be due to the producers retrying (see
> > > > warning logs in my original mail)?
> > > >
> > > > I've done the override for the configs in the repartition topics
> > > > as follows, on one of the brokers:
> > > >
> > > > The values are taken from your KIP-443
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-443%3A+Retur
> > > > n+
> > > > to
> > > > +default+segment.ms+and+segment.index.bytes+in+Streams+repartition
> > > > +default++t
> > > > +default+op
> > > > ics
> > > >
> > > > kafka-topics --zookeeper localhost:2181 --alter --topic
> > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio
> > > > n --config segment.index.bytes=10485760 kafka-topics --zookeeper
> > > > localhost:2181 --alter --topic
> > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio
> > > > n --config segment.bytes= 52428800 kafka-topics --zookeeper
> > > > localhost:2181 --alter --topic
> > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio
> > > > n --config segment.ms=604800000 kafka-topics --zookeeper
> > > > localhost:2181 --alter --topic
> > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio
> > > > n
> > > > --config retention.ms=-1
> > > >
> > > > Verifying afterwards:
> > > >
> > > > kafka-topics --zookeeper localhost:2181 --describe --topic
> > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio
> > > > n
> > > >
> > > > Topic:event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repa
> > > > rt
> > > > it
> > > > ion
> > > >
> > > >  PartitionCount:32       ReplicationFactor:3
> > > >  Configs:segment.bytes=52428800,retention.ms
> > > > =-1,segment.index.bytes=10485760,segment.ms
> > > > =604800000,cleanup.policy=delete
> > > >
> > > > Is there anything that seems off to you? Or something else I can
> > > > investigate further? We'd really like to nail this issue down.
> > > > Especially because the cause seems different than the 'low traffic'
> > > > cause in JIRA issue KAFKA-7190 as the partitions for which errors
> > > > are thrown are receiving data.
> > > >
> > > > Best,
> > > >
> > > > Pieter
> > > >
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Guozhang Wang <wangguoz@gmail.com>
> > > > Verzonden: Wednesday, 5 June 2019 02:23
> > > > Aan: users@kafka.apache.org
> > > > Onderwerp: Re: Repeating UNKNOWN_PRODUCER_ID errors for Kafka
> > > > streams applications
> > > >
> > > > Hello Pieter,
> > > >
> > > > If you only have one record every few seconds that may be too
> > > > small given you have at least 25 partitions (as I saw you have a
> > > > xxx--repartition-24 partition), which means that for a single
> > > > partition, it may not see any records for a long time, and in this
> > > > case you may need to override it to very large values. On the
> > > > other hand, if you can reduce your num.partitions that may also
> > > > help
> > > increasing the traffic per partition.
> > > >
> > > > Also could you show me how did you override the configs in the
> > > > repartition topics?
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Tue, Jun 4, 2019 at 2:10 AM Pieter Hameete
> > > > <pieter.hameete@blockbax.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > Our Kafka streams applications are showing the following warning
> > > > > every few seconds (on each of our 3 brokers, and on each of the
> > > > > 2 instances of the streams application):
> > > > >
> > > > >
> > > > > [Producer
> > > > > clientId=event-rule-engine-dd71ae9b-523c-425d-a7c0-c62993315b30-
> > > > > St
> > > > > re am Thread-1-1_24-producer,
> > > > > transactionalId=event-rule-engine-1_24]
> > > > > Resetting sequence number of batch with current sequence 1 for
> > > > > partition
> > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartit
> > > > > io
> > > > > n-
> > > > > 24
> > > > > to 0
> > > > >
> > > > >
> > > > >
> > > > > Followed by:
> > > > >
> > > > >
> > > > >
> > > > > [Producer
> > > > > clientId=event-rule-engine-dd71ae9b-523c-425d-a7c0-c62993315b30-
> > > > > St
> > > > > re am Thread-1-1_24-producer,
> > > > > transactionalId=event-rule-engine-1_24]
> > > > > Got error produce response with correlation id 5902 on
> > > > > topic-partition
> > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartit
> > > > > io
> > > > > n-
> > > > > 24 , retrying (2147483646 attempts left). Error:
> > > > > UNKNOWN_PRODUCER_ID
> > > > >
> > > > > The brokers are showing errors that look related:
> > > > >
> > > > >
> > > > > Error processing append operation on partition
> > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartit
> > > > > io
> > > > > n-
> > > > > 24
> > > > > (kafka.server.ReplicaManager)
> > > > >
> > > > > org.apache.kafka.common.errors.UnknownProducerIdException: Found
> > > > > no record of producerId=72 on the broker. It is possible that
> > > > > the last message with the producerId=72 has been removed due to
> > > > > hitting the
> > > > retention limit.
> > > > >
> > > > >
> > > > >
> > > > > We would expect the UNKNOWN_PRODUCER_ID error to occur once.
> > > > > After a retry the record would be published on the partition and
> > > > > the PRODUCER_ID would be known. However, this error keeps
> > > > > occurring every few seconds. This is roughly at the same rate at
> > > > > which records are produced on the input topics partitions, so it
> > > > > seems like it occurs for
> > > > (nearly) every input record.
> > > > >
> > > > >
> > > > >
> > > > > The following JIRA issue:
> > > > > https://issues.apache.org/jira/browse/KAFKA-7190
> > > > > looks related. Except the Jira issue mentions ‘little traffic’,
> > > > > and I am not sure if a message per every few seconds is regarded
> > > > > as little
> > > > traffic.
> > > > > Matthias mentions in the issue that a workaround seems to be to
> > > > > increase topic configs `segment.bytes`, `segment.index.bytes`,
> > > > > and `segment.ms` for the corresponding repartition topics. We’ve
> > > > > tried manually overriding these configs for a relevant topic to
> > > > > the config values in the linked pull request
> > > > > (https://github.com/apache/kafka/pull/6511) but this did not
> > > > > result in
> > > > the errors disappearing.
> > > > >
> > > > >
> > > > >
> > > > > Could anyone help us to figure out what is happening here, and
> > > > > why the proposed fix for the above JIRA issue is not working in
> > > > > this
> > case?
> > > > >
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > >
> > > > >
> > > > > Pieter
> > > > >
> > > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
> >
> > --
> > -- Guozhang
> >
>
>
> --
> -- Guozhang
>


-- 
Santilli Jonathan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message