I did not see any zookeeper session expirations. However, I was able to perform live upgrade on my local Mac OSX where I had three 0.8.0 brokers running and took them down one a time and upgraded them to 0.8.1 and did not encounter this issue. However in my stage environment I have 8 brokers running across 4 nodes with 300 topics and replication factor of 4 with various partitioning settings for each topic so it is much harder to pint point the cause for this issue. I keep running into this problem every single time I try to upgrade. Maybe this is configuration related? My stage env has much more complicated broker config files. Thanks, Martin On Thu, Apr 10, 2014 at 8:19 PM, Jun Rao wrote: > One should be able to upgrade from 0.8 to 0.8.1 one broker at a time > online. There are some corner cases that we are trying to patch in 0.8.1.1, > which will be released soon. > > As for your issue, not sure what happened. Do you see any ZK session > expirations in the broker log? > > Thanks, > > Jun > > > On Thu, Apr 10, 2014 at 7:34 PM, Marcin Michalski >wrote: > > > I see that the state-change logs have warning messages of this kind > (Broker > > 7 is the 0.8.1 API and this is a log snippet from that broker) : > > s associated leader epoch 11 is old. Current leader epoch is 11 > > (state.change.logger) > > [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request > from > > controller 1001 with correlation id 0 epoch 7 for partition > > [pets_nec_buygold,0] since its asso > > ciated leader epoch 12 is old. Current leader epoch is 12 > > (state.change.logger) > > [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request > from > > controller 1001 with correlation id 0 epoch 7 for partition > > [cafe_notification,0] since its ass > > ociated leader epoch 11 is old. Current leader epoch is 11 > > (state.change.logger) > > [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state > > change after marking its partition as follower with correlation id 0 from > > controller 1001 epoch > > 6 for partition [set_primary_photo,0] since the new leader 1008 is the > same > > as the old leader (state.change.logger) > > [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state > > change after marking its partition as follower with correlation id 0 from > > controller 1001 epoch > > 6 for partition [external_url,0] since the new leader 1001 is the same as > > the old leader (state.change.logger) > > > > And these are the snippets of the broker log of a 0.8.0 node that I shut > > down before I tried to upgrade it (this is when most topics became > > unusable): > > > > [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request > from > > controller 1001 with correlation id 0 epoch 7 for partition > > [variant_assign,0] since its associated leader epoch 11 is old. Current > > leader epoch is 11 (state.change.logger) > > [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request > from > > controller 1001 with correlation id 0 epoch 7 for partition > > [meetme_new_contact_count,0] since its associated leader epoch 8 is old. > > Current leader epoch is 8 (state.change.logger) > > [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state > > change after marking its partition as follower with correlation id 0 from > > controller 1001 epoch 6 for partition [m3_auth,0] since the new leader 7 > is > > the same as the old leader (state.change.logger) > > [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state > > change after marking its partition as follower with correlation id 0 from > > controller 1001 epoch 6 for partition [newsfeed_likes,0] since the new > > leader 1001 is the same as the old leader (state.change.logger) > > > > In terms of upgrading from 0.8.0 to 0.8.1 is there a recommended approach > > that one should follow? Is it possible to migrate from one version to the > > next one on a live cluster one server a time? > > > > Thanks, > > Martin > > > > > > On Wed, Apr 9, 2014 at 8:38 PM, Jun Rao wrote: > > > > > Was there any error in the controller and the state-change logs? > > > > > > Thanks, > > > > > > Jun > > > > > > > > > On Wed, Apr 9, 2014 at 11:18 AM, Marcin Michalski < > mmichalski@tagged.com > > > >wrote: > > > > > > > Hi, has anyone upgraded their kafka from 0.8.0 to 0.8.1 successfully > > one > > > > broker at a time on a live cluster? > > > > > > > > I am seeing strange behaviors where many of my kafka topics become > > > unusable > > > > (by both consumers and producers). When that happens, I see lots of > > > errors > > > > in the server logs that look like this: > > > > > > > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with > > > > correlation id 2455 from client ReplicaFetcherThread-15-1007 on > > partition > > > > [risk,0] failed due to Topic risk either doesn't exist or is in the > > > process > > > > of being deleted (kafka.server.KafkaApis) > > > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with > > > > correlation id 2455 from client ReplicaFetcherThread-7-1007 on > > partition > > > > [message,0] failed due to Topic message either doesn't exist or is in > > the > > > > process of being deleted (kafka.server.KafkaApis) > > > > > > > > When I try to consume a message from a topic that complained about > the > > > > Topic not existing (above warning), I get the below exception: > > > > > > > > ....topic message --from-beginning > > > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > > > SLF4J: Defaulting to no-operation (NOP) logger implementation > > > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for > > > further > > > > details. > > > > [2014-04-09 10:40:30,571] WARN > > > > > > > > > > > > > > [console-consumer-90716_dkafkadatahub07.tag-dev.com-1397065229615-7211ba72-leader-finder-thread], > > > > Failed to add leader for partitions [message,0]; will retry > > > > (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread) > > > > kafka.common.UnknownTopicOrPartitionException > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > Method) > > > > at > > > > > > > > > > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > > > > at > > > > > > > > > > > > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > > > > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > > > > at java.lang.Class.newInstance0(Class.java:355) > > > > at java.lang.Class.newInstance(Class.java:308) > > > > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:79) > > > > at > > > > > > > > > > > > > > kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:167) > > > > at > > > > > > > > > > > > > > kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60) > > > > at > > > > > > > > > > > > > > kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179) > > > > at > > > > > > > > > > > > > > kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174) > > > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119) > > > > at > > > > > > > > > > > > > > kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174) > > > > at > > > > > > > > > > > > > > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86) > > > > at > > > > > > > > > > > > > > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76) > > > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119) > > > > at > > > > > > > > > > > > > > kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76) > > > > at > > > > > > > > > > > > > > kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95) > > > > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) > > > > ---------- > > > > > > > > *More details about my issues:* > > > > My current configuration in the environment where I am testing the > > > upgrade > > > > is 4 physical servers running 2 brokers each with controlled shutdown > > > > feature enabled. When I shutdown the 2 brokers on one of the existing > > > Kafka > > > > 0.8.0 machines and upgrade that machine to 0.8.1 and restart it, all > is > > > > fine for a bit. Once, the new brokers come up, I ran the > > > > kafka-preferred-replica-election.sh to make sure that started brokers > > > > become leaders of existing topics. The replication factor on the > > topics > > > is > > > > set to 4. I tested both producing and consuming messages against > > brokers > > > > that were leaders with kafka 0.8.0 and 0.8.1 and no issues were > > > > encountered. > > > > > > > > Later, I tried to perform the control shutdown of the 2 additional > > > brokers > > > > on the Kafka server that has 0.8.0 version installed and after the > > broker > > > > shutdown and new leaders were assigned, all of my server logs are > > getting > > > > filled up with the above exceptions and most of my topics are not > > > usable. I > > > > have pulled and build the 0.8.1 kafka code from git last thursday so > I > > > > should be pretty much up to date. So not sure if I am doing something > > > wrong > > > > or if migrating from 0.8.0 to 0.8.1 on a live cluster one server at a > > > time > > > > is not supported. Is there a recommended migration approach that one > > > should > > > > take when migrating from live 0.8.0 to 0.8.1 cluster? > > > > > > > > As to who is the leader of one of the topics that became unusable is > > the > > > > broker that was successfully upgraded to 0.8.1: > > > > Topic:message PartitionCount:1 ReplicationFactor:4 > > Configs: > > > > Topic: message Partition: 0 * Leader: 1007 * Replicas: > > > > 1007,8,9,1001 Isr: 1001,1007,8 > > > > > > > > Brokers 9 and 1009 where shutdown from one physical server that had > > kafka > > > > 0.8.0 installed when these problems started occurring (I was planning > > to > > > > upgrade them to 0.8.1). The only way I can recover from this state is > > to > > > > shutdown all brokers and delete all of kafka topic logs plus > zookeeper > > > > kafka directory and start with new cluster. > > > > > > > > > > > > Your help in this matter is greatly appreciated. > > > > > > > > Thanks, > > > > Martin > > > > > > > > > >