kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From svante karlsson <s...@csi.se>
Subject Re: kafka.server.ReplicaManager error
Date Thu, 05 Feb 2015 21:28:35 GMT
In our case unclean leader selection was enabled

As the cluster should have been empty I can't really say that we did not
lose any data but as I wrote earlier, I could not get the log messages to
stop until I took down all brokers at the same time.








2015-02-05 22:16 GMT+01:00 Kyle Banker <kylebanker@gmail.com>:

> Thanks for sharing, svante. We're also running 0.8.2.
>
> Our cluster appears to be completely unusable at this point. We tried
> restarting the "down" broker with a clean log directory, and it's doing
> nothing. It doesn't seem to be able to get topic data, which this Zookeeper
> message appears to confirm:
>
> [ProcessThread(sid:5 cport:-1)::PrepRequestProcessor@645] - Got user-level
> KeeperException when processing sessionid:0x54b0e251a5cd0ec type:setData
> cxid:0x2b7ab zxid:0x100b9ad88 txntype:-1 reqpath:n/a Error
> Path:/brokers/topics/mytopic/partitions/143/state Error:KeeperErrorCode =
> BadVersion for /brokers/topics/mytopic/partitions/143/state
>
> It's probably worthwhile to note that we've disabled unclean leader
> election.
>
>
>
> On Thu, Feb 5, 2015 at 2:01 PM, svante karlsson <saka@csi.se> wrote:
>
> > I believe I've had the same problem on the 0.8.2 rc2. We had a idle test
> > cluster with unknown health status and I applied rc3 without checking if
> > everything was ok before. Since that cluster had been doing nothing for a
> > couple of days and the retention time was 48 hours it's reasonable to
> > assume that no actual data was left on the cluster. The same type of logs
> > was emitted in big amounts and never stopped. I then rebooted each
> > zookeeper in series. No change, Then bumped each broker - no change,
> > Finally I took down all brokers at the same time.
> >
> > The logging stopped but then one broker did not have any partitions in
> > sync, including the the internal consumer offset topic that was living
> > (with replicas=1) on that broker. I then bumped this broker once more and
> > then my whole cluster became in sync.
> >
> > I suspect that something related to 0 size topics caused this since the
> the
> > cluster worked fine the week before during testing and also after during
> > more testing with rc3.
> >
> >
> >
> >
> >
> >
> >
> > 2015-02-05 19:22 GMT+01:00 Kyle Banker <kylebanker@gmail.com>:
> >
> > > Digging in a bit more, it appears that the "down" broker had likely
> > > partially failed. Thus, it was still attempting to fetch offsets that
> no
> > > longer exists. Does this make sense as an explanation of the
> > > above-mentioned behavior?
> > >
> > > On Thu, Feb 5, 2015 at 10:58 AM, Kyle Banker <kylebanker@gmail.com>
> > wrote:
> > >
> > > > Dug into this a bit more, and it turns out that we lost one of our 9
> > > > brokers at the exact moment when this started happening. At the time
> > that
> > > > we lost the broker, we had no under-replicated partitions. Since the
> > > broker
> > > > disappeared, we've had a fairly constant number of under replicated
> > > > partitions. This makes some sense, of course.
> > > >
> > > > Still, the log message doesn't.
> > > >
> > > > On Thu, Feb 5, 2015 at 10:39 AM, Kyle Banker <kylebanker@gmail.com>
> > > wrote:
> > > >
> > > >> I have a 9-node Kafka cluster, and all of the brokers just started
> > > >> spouting the following error:
> > > >>
> > > >> ERROR [Replica Manager on Broker 1]: Error when processing fetch
> > request
> > > >> for partition [mytopic,57] offset 0 from follower with correlation
> id
> > > >> 58166. Possible cause: Request for offset 0 but we only have log
> > > segments
> > > >> in the range 39 to 39. (kafka.server.ReplicaManager)
> > > >>
> > > >> The "mytopic" topic has a replication factor of 3, and metrics are
> > > >> showing a large number of under replicated partitions.
> > > >>
> > > >> My assumption is that a log aged out but that the replicas weren't
> > aware
> > > >> of it.
> > > >>
> > > >> In any case, this problem isn't fixing itself, and the volume of log
> > > >> messages of this type is enormous.
> > > >>
> > > >> What might have caused this? How does one resolve it?
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message