kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Goya <d...@gradientx.com>
Subject Re: Migrating a cluster from 0.8.0 to 0.8.1
Date Tue, 24 Dec 2013 05:48:41 GMT
Thanks for the help with all this stuff guys!

I completed a rolling upgrade to trunk/b23cf19 and was able to issue a
re-election without any brokers dropping out of the ISR list.


On Mon, Dec 23, 2013 at 8:43 PM, Jun Rao <junrao@gmail.com> wrote:

> This is probably related to KAFKA-1154. Could you upgrade to the latest
> trunk?
>
> Thanks,
>
> Jun
>
>
> On Mon, Dec 23, 2013 at 3:21 PM, Drew Goya <drew@gradientx.com> wrote:
>
> > Hey All, another thing to report for my 0.8.1 migration.  I am seeing
> these
> > errors occasionally right after a I run a leader election.  This looks to
> > be related to KAFKA-860 as it is the same exception.  I see this issue
> was
> > closed a while go though and I should be running a commit with the fix
> in.
> >  I'm on trunk/87efda.
> >
> > I also see there is a more recent issue with replica threads dying out
> > while becoming followers (KAFKA-1178) but I'm not seeing that exception.
> >  I'm going to roll updates through the cluster and bring my brokers up to
> > trunk/b23cf1 and see how that goes.
> >
> > [2013-12-23 22:54:38,389] ERROR [ReplicaFetcherThread-0-11], Error due to
> >  (kafka.server.ReplicaFetcherThread)
> > kafka.common.KafkaException: error processing data for partition
> > [Events2,113] offset 1077499310
> > at
> >
> >
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1$$anonfun$apply$mcV$sp$2.apply(AbstractFetcherThread.scala:139)
> > at
> >
> >
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1$$anonfun$apply$mcV$sp$2.apply(AbstractFetcherThread.scala:111)
> > at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
> > at
> >
> >
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1.apply$mcV$sp(AbstractFetcherThread.scala:111)
> > at
> >
> >
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1.apply(AbstractFetcherThread.scala:111)
> > at
> >
> >
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1.apply(AbstractFetcherThread.scala:111)
> > at kafka.utils.Utils$.inLock(Utils.scala:538)
> > at
> >
> >
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:110)
> > at
> > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88)
> > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> > Caused by: java.lang.RuntimeException: Offset mismatch: fetched offset =
> > 1077499310, log end offset = 1077499313.
> > at
> >
> >
> kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:49)
> > at
> >
> >
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$1$$anonfun$apply$mcV$sp$2.apply(AbstractFetcherThread.scala:130)
> > ... 9 more
> >
> >
> > On Mon, Dec 23, 2013 at 2:50 PM, Drew Goya <drew@gradientx.com> wrote:
> >
> > > We are running on an Amazon Linux AMI, this is our specific version:
> > >
> > > Linux version 2.6.32-220.23.1.el6.centos.plus.x86_64 (
> > > mockbuild@c6b5.bsys.dev.centos.org) (gcc version 4.4.6 20110731 (Red
> Hat
> > > 4.4.6-3) (GCC) ) #1 SMP Tue Jun 19 04:14:37 BST 2012
> > >
> > >
> > > On Mon, Dec 23, 2013 at 11:24 AM, Guozhang Wang <wangguoz@gmail.com
> > >wrote:
> > >
> > >> Hi Drew,
> > >>
> > >> I tried the kafka-server-stop script and it worked for me. Wondering
> > which
> > >> OS are you using?
> > >>
> > >> Guozhang
> > >>
> > >>
> > >> On Mon, Dec 23, 2013 at 10:57 AM, Drew Goya <drew@gradientx.com>
> wrote:
> > >>
> > >> > Occasionally I do have to hard kill brokers, the
> kafka-server-stop.sh
> > >> > script stopped working for me a few months ago.  I saw another
> thread
> > in
> > >> > the mailing list mentioning the issue too.  I'll change the signal
> > back
> > >> to
> > >> > SIGTERM and run that way for a while, hopefully the problem goes
> away.
> > >> >
> > >> > This is the commit where it changed:
> > >> >
> > >> >
> > >> >
> > >>
> >
> https://github.com/apache/kafka/commit/51de7c55d2b3107b79953f401fc8c9530bd0eea0
> > >> >
> > >> >
> > >> > On Mon, Dec 23, 2013 at 10:09 AM, Neha Narkhede <
> > >> neha.narkhede@gmail.com
> > >> > >wrote:
> > >> >
> > >> > > Are you hard killing the brokers? And is this issue reproducible?
> > >> > >
> > >> > >
> > >> > > On Sat, Dec 21, 2013 at 11:39 AM, Drew Goya <drew@gradientx.com>
> > >> wrote:
> > >> > >
> > >> > > > Hey guys, another small issue to report for 0.8.1.  After
a
> couple
> > >> > days 3
> > >> > > > of my brokers had fallen off the ISR list for a 2-3 of their
> > >> > partitions.
> > >> > > >
> > >> > > > I didn't see anything unusual in the log and I just restarted
> one.
> > >>  It
> > >> > > came
> > >> > > > up fine but as it loaded its logs I these messages showed
up:
> > >> > > >
> > >> > > > [2013-12-21 19:25:19,968] WARN [ReplicaFetcherThread-0-2],
> > Replica 1
> > >> > for
> > >> > > > partition [Events2,58] reset its fetch offset to current
leader
> > 2's
> > >> > start
> > >> > > > offset 1042738519 (kafka.server.ReplicaFetcherThread)
> > >> > > > [2013-12-21 19:25:19,969] WARN [ReplicaFetcherThread-0-14],
> > Replica
> > >> 1
> > >> > for
> > >> > > > partition [Events2,28] reset its fetch offset to current
leader
> > 14's
> > >> > > start
> > >> > > > offset 1043415514 (kafka.server.ReplicaFetcherThread)
> > >> > > > [2013-12-21 19:25:20,012] WARN [ReplicaFetcherThread-0-2],
> Current
> > >> > offset
> > >> > > > 1011209589 for partition [Events2,58] out of range; reset
offset
> > to
> > >> > > > 1042738519 (kafka.server.ReplicaFetcherThread)
> > >> > > > [2013-12-21 19:25:20,013] WARN [ReplicaFetcherThread-0-14],
> > Current
> > >> > > offset
> > >> > > > 1010086751 for partition [Events2,28] out of range; reset
offset
> > to
> > >> > > > 1043415514 (kafka.server.ReplicaFetcherThread)
> > >> > > > [2013-12-21 19:25:20,036] WARN [ReplicaFetcherThread-0-14],
> > Replica
> > >> 1
> > >> > for
> > >> > > > partition [Events2,71] reset its fetch offset to current
leader
> > 14's
> > >> > > start
> > >> > > > offset 1026871415 (kafka.server.ReplicaFetcherThread)
> > >> > > > [2013-12-21 19:25:20,036] WARN [ReplicaFetcherThread-0-2],
> > Replica 1
> > >> > for
> > >> > > > partition [Events2,44] reset its fetch offset to current
leader
> > 2's
> > >> > start
> > >> > > > offset 1052372907 (kafka.server.ReplicaFetcherThread)
> > >> > > > [2013-12-21 19:25:20,036] WARN [ReplicaFetcherThread-0-14],
> > Current
> > >> > > offset
> > >> > > > 993879706 for partition [Events2,71] out of range; reset
offset
> to
> > >> > > > 1026871415 (kafka.server.ReplicaFetcherThread)
> > >> > > > [2013-12-21 19:25:20,036] WARN [ReplicaFetcherThread-0-2],
> Current
> > >> > offset
> > >> > > > 1020715056 for partition [Events2,44] out of range; reset
offset
> > to
> > >> > > > 1052372907 (kafka.server.ReplicaFetcherThread)
> > >> > > >
> > >> > > > Judging by the network traffic and disk usage changes after
the
> > >> reboot
> > >> > > > (both jumped up) a couple of the partition replicas had
fallen
> > >> behind
> > >> > and
> > >> > > > are now catching up.
> > >> > > >
> > >> > > >
> > >> > > > On Thu, Dec 19, 2013 at 4:37 PM, Neha Narkhede <
> > >> > neha.narkhede@gmail.com
> > >> > > > >wrote:
> > >> > > >
> > >> > > > > Hi Drew,
> > >> > > > >
> > >> > > > > That problem will be fixed by
> > >> > > > > https://issues.apache.org/jira/browse/KAFKA-1074. I
think we
> > are
> > >> > close
> > >> > > > to
> > >> > > > > checking that in to trunk.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Neha
> > >> > > > >
> > >> > > > >
> > >> > > > > On Wed, Dec 18, 2013 at 9:02 AM, Drew Goya <
> drew@gradientx.com>
> > >> > wrote:
> > >> > > > >
> > >> > > > > > Thanks Neha, I rolled upgrades and completed a
rebalance!
> > >> > > > > >
> > >> > > > > > I ran into a few small issues I figured I would
share.
> > >> > > > > >
> > >> > > > > > On a few Brokers, there were some log directories
left over
> > from
> > >> > some
> > >> > > > > > failed rebalances which prevented the 0.8.1 brokers
from
> > >> starting
> > >> > > once
> > >> > > > I
> > >> > > > > > completed the upgrade.  These directories contained
an index
> > >> file
> > >> > > and a
> > >> > > > > > zero size log file, once I cleaned those out the
brokers
> were
> > >> able
> > >> > to
> > >> > > > > start
> > >> > > > > > up fine.  If anyone else runs into the same problem,
and is
> > >> running
> > >> > > > RHEL,
> > >> > > > > > this is the bash script I used to clean them out:
> > >> > > > > >
> > >> > > > > > du --max-depth=1 -h /data/kafka/logs | grep K
| sed
> s/.*K.// |
> > >> sudo
> > >> > > rm
> > >> > > > -r
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Tue, Dec 17, 2013 at 10:42 AM, Neha Narkhede
<
> > >> > > > neha.narkhede@gmail.com
> > >> > > > > > >wrote:
> > >> > > > > >
> > >> > > > > > > There are no compatibility issues. You can
roll upgrades
> > >> through
> > >> > > the
> > >> > > > > > > cluster one node at a time.
> > >> > > > > > >
> > >> > > > > > > Thanks
> > >> > > > > > > Neha
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Tue, Dec 17, 2013 at 9:15 AM, Drew Goya
<
> > >> drew@gradientx.com>
> > >> > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > So I'm going to be going through the
process of
> upgrading
> > a
> > >> > > cluster
> > >> > > > > > from
> > >> > > > > > > > 0.8.0 to the trunk (0.8.1).
> > >> > > > > > > >
> > >> > > > > > > > I'm going to be expanding this cluster
several times and
> > the
> > >> > > > problems
> > >> > > > > > > with
> > >> > > > > > > > reassigning partitions in 0.8.0 mean
I have to move to
> > >> > > trunk(0.8.1)
> > >> > > > > > asap.
> > >> > > > > > > >
> > >> > > > > > > > Will it be safe to roll upgrades through
the cluster one
> > by
> > >> > one?
> > >> > > > > > > >
> > >> > > > > > > > Also are there any client compatibility
issues I need to
> > >> worry
> > >> > > > about?
> > >> > > > > > >  Am I
> > >> > > > > > > > going to need to pause/upgrade all my
> consumers/producers
> > at
> > >> > once
> > >> > > > or
> > >> > > > > > can
> > >> > > > > > > I
> > >> > > > > > > > roll upgrades through the cluster and
then upgrade my
> > >> clients
> > >> > one
> > >> > > > by
> > >> > > > > > one?
> > >> > > > > > > >
> > >> > > > > > > > Thanks in advance!
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> -- Guozhang
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message