kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcin Michalski <mmichal...@tagged.com>
Subject Re: Upgrading from 0.8.0 to 0.8.1 one broker at a time issues
Date Fri, 11 Apr 2014 17:47:40 GMT
I did not see any zookeeper session expirations. However, I was able to
perform live upgrade on my local Mac OSX where I had three 0.8.0 brokers
running and took them down one a time and upgraded them to 0.8.1 and did
not encounter this issue. However in my stage environment I have 8 brokers
running across 4 nodes with 300 topics and replication factor of 4 with
various partitioning settings for each topic so it is much harder to pint
point the cause for this issue. I keep running into this problem every
single time I try to upgrade.  Maybe this is configuration related? My
stage env has much more complicated broker config files.

Thanks,
Martin


On Thu, Apr 10, 2014 at 8:19 PM, Jun Rao <junrao@gmail.com> wrote:

> One should be able to upgrade from 0.8 to 0.8.1 one broker at a time
> online. There are some corner cases that we are trying to patch in 0.8.1.1,
> which will be released soon.
>
> As for your issue, not sure what happened. Do you see any ZK session
> expirations in the broker log?
>
> Thanks,
>
> Jun
>
>
> On Thu, Apr 10, 2014 at 7:34 PM, Marcin Michalski <mmichalski@tagged.com
> >wrote:
>
> > I see that the state-change logs have warning messages of this kind
> (Broker
> > 7 is the 0.8.1 API and this is a log snippet from that broker) :
> > s associated leader epoch 11 is old. Current leader epoch is 11
> > (state.change.logger)
> > [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request
> from
> > controller 1001 with correlation id 0 epoch 7 for partition
> > [pets_nec_buygold,0] since its asso
> > ciated leader epoch 12 is old. Current leader epoch is 12
> > (state.change.logger)
> > [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request
> from
> > controller 1001 with correlation id 0 epoch 7 for partition
> > [cafe_notification,0] since its ass
> > ociated leader epoch 11 is old. Current leader epoch is 11
> > (state.change.logger)
> > [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state
> > change after marking its partition as follower with correlation id 0 from
> > controller 1001 epoch
> > 6 for partition [set_primary_photo,0] since the new leader 1008 is the
> same
> > as the old leader (state.change.logger)
> > [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state
> > change after marking its partition as follower with correlation id 0 from
> > controller 1001 epoch
> > 6 for partition [external_url,0] since the new leader 1001 is the same as
> > the old leader (state.change.logger)
> >
> > And these are the snippets of the broker log of a 0.8.0 node that I shut
> > down before I tried to upgrade it (this is when most topics became
> > unusable):
> >
> > [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request
> from
> > controller 1001 with correlation id 0 epoch 7 for partition
> > [variant_assign,0] since its associated leader epoch 11 is old. Current
> > leader epoch is 11 (state.change.logger)
> > [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request
> from
> > controller 1001 with correlation id 0 epoch 7 for partition
> > [meetme_new_contact_count,0] since its associated leader epoch 8 is old.
> > Current leader epoch is 8 (state.change.logger)
> > [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state
> > change after marking its partition as follower with correlation id 0 from
> > controller 1001 epoch 6 for partition [m3_auth,0] since the new leader 7
> is
> > the same as the old leader (state.change.logger)
> > [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state
> > change after marking its partition as follower with correlation id 0 from
> > controller 1001 epoch 6 for partition [newsfeed_likes,0] since the new
> > leader 1001 is the same as the old leader (state.change.logger)
> >
> > In terms of upgrading from 0.8.0 to 0.8.1 is there a recommended approach
> > that one should follow? Is it possible to migrate from one version to the
> > next one on a live cluster one server a time?
> >
> > Thanks,
> > Martin
> >
> >
> > On Wed, Apr 9, 2014 at 8:38 PM, Jun Rao <junrao@gmail.com> wrote:
> >
> > > Was there any error in the controller and the state-change logs?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Wed, Apr 9, 2014 at 11:18 AM, Marcin Michalski <
> mmichalski@tagged.com
> > > >wrote:
> > >
> > > > Hi, has anyone upgraded their kafka from 0.8.0 to 0.8.1 successfully
> > one
> > > > broker at a time on a live cluster?
> > > >
> > > > I am seeing strange behaviors where many of my kafka topics become
> > > unusable
> > > > (by both consumers and producers). When that happens, I see lots of
> > > errors
> > > > in the server logs that look like this:
> > > >
> > > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> > > > correlation id 2455 from client ReplicaFetcherThread-15-1007 on
> > partition
> > > > [risk,0] failed due to Topic risk either doesn't exist or is in the
> > > process
> > > > of being deleted (kafka.server.KafkaApis)
> > > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with
> > > > correlation id 2455 from client ReplicaFetcherThread-7-1007 on
> > partition
> > > > [message,0] failed due to Topic message either doesn't exist or is in
> > the
> > > > process of being deleted (kafka.server.KafkaApis)
> > > >
> > > > When I try to consume a message from a topic that complained about
> the
> > > > Topic not existing (above warning), I get the below exception:
> > > >
> > > > ....topic message --from-beginning
> > > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> > > > SLF4J: Defaulting to no-operation (NOP) logger implementation
> > > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
> > > further
> > > > details.
> > > > [2014-04-09 10:40:30,571] WARN
> > > >
> > > >
> > >
> >
> [console-consumer-90716_dkafkadatahub07.tag-dev.com-1397065229615-7211ba72-leader-finder-thread],
> > > > Failed to add leader for partitions [message,0]; will retry
> > > > (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread)
> > > > kafka.common.UnknownTopicOrPartitionException
> > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> > > > at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > > > at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > > > at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > at java.lang.Class.newInstance0(Class.java:355)
> > > > at java.lang.Class.newInstance(Class.java:308)
> > > > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:79)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:167)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174)
> > > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76)
> > > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76)
> > > > at
> > > >
> > > >
> > >
> >
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95)
> > > > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> > > > ----------
> > > >
> > > > *More details about my issues:*
> > > > My current configuration in the environment where I am testing the
> > > upgrade
> > > > is 4 physical servers running 2 brokers each with controlled shutdown
> > > > feature enabled. When I shutdown the 2 brokers on one of the existing
> > > Kafka
> > > > 0.8.0 machines and upgrade that machine to 0.8.1 and restart it, all
> is
> > > > fine for a bit. Once, the new brokers come up, I ran the
> > > > kafka-preferred-replica-election.sh to make sure that started brokers
> > > > become leaders of existing topics.  The replication factor on the
> > topics
> > > is
> > > > set to 4. I tested both producing and consuming messages against
> > brokers
> > > > that were leaders with kafka 0.8.0 and 0.8.1 and no issues were
> > > > encountered.
> > > >
> > > > Later, I tried to perform the control shutdown of the 2 additional
> > > brokers
> > > > on the Kafka server that has 0.8.0 version installed and after the
> > broker
> > > > shutdown and new leaders were assigned, all of my server logs are
> > getting
> > > > filled up with the above exceptions and most of my topics are not
> > > usable. I
> > > > have pulled and build the 0.8.1 kafka code from git last thursday so
> I
> > > > should be pretty much up to date. So not sure if I am doing something
> > > wrong
> > > > or if migrating from 0.8.0 to 0.8.1 on a live cluster one server at a
> > > time
> > > > is not supported. Is there a recommended migration approach that one
> > > should
> > > > take when migrating from live 0.8.0 to 0.8.1 cluster?
> > > >
> > > > As to who is the leader of one of the topics that became unusable is
> > the
> > > > broker that was successfully upgraded to 0.8.1:
> > > > Topic:message   PartitionCount:1        ReplicationFactor:4
> > Configs:
> > > >         Topic: message  Partition: 0   * Leader: 1007 *   Replicas:
> > > > 1007,8,9,1001 Isr: 1001,1007,8
> > > >
> > > > Brokers 9 and 1009 where shutdown from one physical server that had
> > kafka
> > > > 0.8.0 installed when these problems started occurring (I was planning
> > to
> > > > upgrade them to 0.8.1). The only way I can recover from this state is
> > to
> > > > shutdown all brokers and delete all of kafka topic logs plus
> zookeeper
> > > > kafka directory and start with new cluster.
> > > >
> > > >
> > > > Your help in this matter is greatly appreciated.
> > > >
> > > > Thanks,
> > > > Martin
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message