kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dong, John" <zunhai.d...@ebay.com>
Subject Replication stop working
Date Tue, 27 Jan 2015 18:59:22 GMT
Hi,

I am new to this forum and I am not sure this is the correct mailing list for sending question.
If not, please let me know and I will stop.

I am looking for help to resolve replication issue. Replication stopped working a while back.

Kafka environment: Kafka 0.8.1.1, Centos 6.5, 7 node cluster, default replication-factor 2,
10 partition per topic.

Initially each partition is residing on two different nodes. It has been this way for several
months and working. Starting two weeks ago, two things happened.

  1.  one node's disk usage got to 100% and crashed kafka process. So we had to delete some
*.log and *.index and restarted kafka process.
  2.  In another case, some other node's disk usage reached 90%. Someone deleted some *.log
and *.index files without shutting down kafka process. This caused issues and kafka was unable
to restarted. I had to delete all *.log and *.index on this node to bring kafka back online.

Now replication is all broken. Most of the partition has only one leader and one in ISR, even
though replication is setup with two broker ids. Whenever I shutdown kafka process on a node,
whatever leader running on this node will get moved to another node that is defined in replication.
After I restart kafka on this node, it will never become a follower and its data directory
never get updated.

I tried the following:


  1.  I had turned on TRACE/DEBUG level with kafka and zookeeper. I did not find anything
that can help.
  2.   I also tried to manipulate replication configuration in zookeeper using zkCLI.sh, like
adding a follower to ISR list. That did not initiate a fether process to make itself become
a follower.
  3.   I also created new topic with replication working initially. But as soon as I shutdown
kafka on one of its two nodes, that partition loses one replica in ISR and never come back.
This confirms that it is reproducible.
  4.  I ran kafka preferred replication election tool to force re-election of leader. That
did not do anything. It is like nothing happen to the cluster.
  5.  I added num.replica.fetchers=10 to server.properties and restarted kakfa. That did not
do anything.

Has anyone have any experience with this ? Or any advice where to look and what the next steps
are for trouble-shooting ? There are only two things that I may have to do.


  1.  Shutdown all kafka and zookeeper and restart them. I really do not want to go this route
unless I have to. I would like to identify the root cause of it and not to randomly restart
the whole cluster.
  2.  Move all topics to another kafka cluster, and rebuild it. This will be very time consuming
and a lot of changes in the application.

Thanks.

John Dong

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message