kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Song <chen.song...@gmail.com>
Subject Re: Topic partitions randomly failed on live system
Date Mon, 21 Sep 2015 21:11:41 GMT
Upvote this problem. We are using the same version 0.8.2.0 and see a
similar issue.

The stacktrace we have seen is

[2015-09-18 08:57:47,147] ERROR [Replica Manager on Broker 58]: Error when
processing fetch request for partition [topic1,22] offset 19068459 from
follower with correlation id 234437982. Possible cause: Request for offset
19068459 but we only have log segments
kafka.common.NotAssignedReplicaException: Leader 58 failed to record
follower 31's position -1 since the replica is not recognized to be one of
the assigned replicas 58 for partition [topic2,1]
        at
kafka.server.ReplicaManager.updateReplicaLEOAndPartitionHW(ReplicaManager.scala:574)
        at
kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:388)
        at
kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:386)
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

On Tue, Aug 25, 2015 at 4:26 AM, Simon Cooper <
simon.cooper@featurespace.co.uk> wrote:

> This has happened again on the same system. Is anyone able to offer any
> pointers towards a possible cause, as we have no idea what is wrong or how
> to stop it happening again.
>
> We need to diagnose and fix this issue soon, as we can't have a live
> system that randomly fails due to unknown causes!
>
> Thanks,
> SimonC
>
> -----Original Message-----
> From: Simon Cooper [mailto:simon.cooper@featurespace.co.uk]
> Sent: 17 August 2015 12:31
> To: users@kafka.apache.org
> Subject: Topic partitions randomly failed on live system
>
> Hi,
>
> We've had an issue on a live system (3 brokers, ~10 topics, some
> replicated, some partitioned) where a partition wasn't properly reassigned,
> causing several other partitions to go down.
>
> First, this exception happened on broker 1 (we weren't doing anything
> particular on the system at the time):
>
> ERROR [AddPartitionsListener on 1]: Error while handling add partitions
> for data path /brokers/topics/topic1
> (kafka.controller.PartitionStateMachine$AddPartitionsListener)
> java.util.NoSuchElementException: key not found: [topic1,0]
>         at scala.collection.MapLike$class.default(MapLike.scala:228)
>         at scala.collection.AbstractMap.default(Map.scala:58)
>         at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>         at
> kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:112)
>         at
> kafka.controller.ControllerContext$$anonfun$replicasForPartition$1.apply(KafkaController.scala:111)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
>         at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>        at
> scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
>         at scala.collection.SetLike$class.map(SetLike.scala:93)
>         at scala.collection.AbstractSet.map(Set.scala:47)
>         at
> kafka.controller.ControllerContext.replicasForPartition(KafkaController.scala:111)
>         at
> kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:485)
>         at
> kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply$mcV$sp(PartitionStateMachine.scala:530)
>         at
> kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:519)
>         at
> kafka.controller.PartitionStateMachine$AddPartitionsListener$$anonfun$handleDataChange$1.apply(PartitionStateMachine.scala:519)
>         at kafka.utils.Utils$.inLock(Utils.scala:535)
>         at
> kafka.controller.PartitionStateMachine$AddPartitionsListener.handleDataChange(PartitionStateMachine.scala:518)
>         at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:547)
>         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
>
> At this point, broker 2 started continually spammed these messages
> (mentioning other topics, not just topic1):
>
> ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic1,2] to
> broker 1:class kafka.common.UnknownException
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic2,0] to
> broker 1:class kafka.common.UnknownException
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-1], Error for partition [othertopic3,0] to
> broker 1:class kafka.common.UnknownException
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-1], Error for partition [topic1,0] to broker
> 1:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread)
>
> And broker 1 had these messages, but only for topic1:
>
> ERROR [KafkaApi-1] error when handling request Name: FetchRequest;
> Version: 0; CorrelationId: 41182755; ClientId: ReplicaFetcherThread-0-1;
> ReplicaId: 2; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [topic1,0]
> -> PartitionFetchInfo(0,1048576) (kafka.server.KafkaApis)
> kafka.common.NotAssignedReplicaException: Leader 1 failed to record
> follower 2's position 0 since the replica is not recognized to be one of
> the assigned replicas 1 for partition [topic1,0]
>         at
> kafka.server.ReplicaManager.updateReplicaLEOAndPartitionHW(ReplicaManager.scala:574)
>         at
> kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:388)
>         at
> kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:386)
>         at
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>         at
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>         at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>         at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
>         at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>         at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
>         at
> kafka.server.KafkaApis.recordFollowerLogEndOffsets(KafkaApis.scala:386)
>         at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:351)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
>         at
> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
>         at java.lang.Thread.run(Thread.java:745)
>
> At this time, any topic that had broker 1 as a leader were not working. ZK
> thought that everything was ok and in sync.
>
> Restarting broker 1 fixed the broken topics for a bit, until broker 1 was
> reassigned as leader of some topics, at which point it broke again.
> Restarting broker 2 fixed it (!!!!).
>
> We're using kafka-2.10.0_0.8.2.0. Could anyone explain what happened, and
> (most importantly) how we stop it happening again in the future?
>
> Many thanks,
> SimonC
>
>


-- 
Chen Song

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message