kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Sparks <anthony.spark...@gmail.com>
Subject Re: Unable to start Kafka cluster after crash (0.8.2.2)
Date Wed, 24 Feb 2016 21:55:01 GMT
Thank you so much, Alexis -- good to know that those are not actual
failures.

We were able to get the cluster back up; but with minor data loss.  We
picked the server with the most up to date offsets of the bunch (it wasn't
the most up to date on all topics, but most).  We then configured all three
servers to unclean.leader.election.enable=true and started them one by
one.  Like I said, this caused us to lose some data as the out of sync
replica became the new leader and thus when the old leader came online it
truncated its messages to match the new leader.

We traced back the cause of the failure to two out of the three VM servers
crashing, going completely offline and rebooting.  This caused our third
server to be the primary; however I don't understand how it could accept
messages being an out of sync replica?  Furthermore, we have set
min.insync.replicas=2; which should prevent anything from writing to the
topics.  I would have expected the cluster to be "down" but to not have any
troubles starting back up.

Does this sound like a bug?

Thank you,

Tony


On Wed, Feb 24, 2016 at 1:32 PM, Alexis Midon <
alexis.midon@airbnb.com.invalid> wrote:

> regarding the "Allocation Failure" messages, these are not errors, it's the
> standard behavior of a generational GC. I let you google the details, there
> are tons of resources.
> for ex,
>
> https://plumbr.eu/blog/garbage-collection/understanding-garbage-collection-logs
>
> I believe you should stop the broker 1, and wipe out the data for the
> topic. Once restarted, replication will restore the data.
>
> On Wed, Feb 24, 2016 at 8:22 AM Anthony Sparks <anthony.sparks31@gmail.com
> >
> wrote:
>
> > Hello,
> >
> > Our Kafka cluster (3 servers, each server has Zookeeper and Kafka
> installed
> > and running) crashed, and actually out of the 6 processes only one
> > Zookeeper instance remained alive.  The logs do not indicate much, the
> only
> > errors shown were:
> >
> > *2016-02-21T12:21:36.881+0000: 27445381.013: [GC (Allocation Failure)
> > 27445381.013: [ParNew: 136472K->159K(153344K), 0.0047077 secs]
> > 139578K->3265K(507264K), 0.0048552 secs] [Times: user=0.01 sys=0.00,
> > real=0.01 secs]*
> >
> > These errors were both in the Zookeeper and the Kafka logs, and it
> appears
> > they have been happening everyday (with no impact on Kafka, except for
> > maybe now?).
> >
> > The crash is concerning, but not as concerning as what we are
> encountering
> > right now.  I am unable to get the cluster back up.  Two of the three
> nodes
> > halt with this fatal error:
> >
> > *[2016-02-23 21:18:47,251] FATAL [ReplicaFetcherThread-0-0], Halting
> > because log truncation is not allowed for topic audit_data, Current
> leader
> > 0's latest offset 52844816 is less than replica 1's latest offset
> 52844835
> > (kafka.server.ReplicaFetcherThread)*
> >
> > The other node that manages to stay alive is unable to fulfill writes
> > because we have min.ack set to 2 on the producers (requiring at least two
> > nodes to be available).  We could change this, but that doesn't fix our
> > overall problem.
> >
> > In browsing the Kafka code, in ReplicaFetcherThread.scala there is this
> > little nugget:
> >
> > *// Prior to truncating the follower's log, ensure that doing so is not
> > disallowed by the configuration for unclean leader election.*
> > *// This situation could only happen if the unclean election
> configuration
> > for a topic changes while a replica is down. Otherwise,*
> > *// we should never encounter this situation since a non-ISR leader
> cannot
> > be elected if disallowed by the broker configuration.*
> > *if (!LogConfig.fromProps(brokerConfig.toProps,
> > AdminUtils.fetchTopicConfig(replicaMgr.zkClient,*
> > *topicAndPartition.topic)).uncleanLeaderElectionEnable) {*
> > *    // Log a fatal error and shutdown the broker to ensure that data
> loss
> > does not unexpectedly occur.*
> > *    fatal("Halting because log truncation is not allowed for topic
> > %s,".format(topicAndPartition.topic) +*
> > *      " Current leader %d's latest offset %d is less than replica %d's
> > latest offset %d"*
> > *      .format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId,
> > replica.logEndOffset.messageOffset))*
> > *    Runtime.getRuntime.halt(1)*
> > *}*
> >
> > For each one of our Kafka instances we have them set at:
> > *unclean.leader.election.enable=false *which hasn't changed at all since
> we
> > deployed the cluster (verified by file modification stamps).  This to me
> > would indicate the above comment assertion is incorrect; we have
> > encountered a non-ISR leader elected even though it is configured not to
> do
> > so.
> >
> > Any ideas on how to work around this?
> >
> > Thank you,
> >
> > Tony Sparks
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message