kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gleb Zhukov <gzhu...@iponweb.net>
Subject Kafka loses data after one broker reboot
Date Tue, 01 Sep 2015 13:26:25 GMT
Hi, All!
We use 3 kafka's brokers with replica-factor 2. Today we were doing
partitions reassignment and one of our brokers was rebooted due some
hardware problem. After broker had returned back to work we found that our
consumer doesn't work with errors like:

ERROR java.lang.AssertionError: assumption failed: 765994 exceeds 6339
ERROR java.lang.AssertionError: assumption failed: 1501252 exceeds 416522
ERROR java.lang.AssertionError: assumption failed: 950819 exceeds 805377

Some logs from broker:

[2015-09-01 13:00:16,976] ERROR [Replica Manager on Broker 61]: Error when
processing fetch request for partition [avro_match,27] offset 208064729
from consumer with correlation id 0. Possible cause: Request for offset
208064729 but we only have log segments in the range 209248794 to
250879159. (kafka.server.ReplicaManager)
[2015-09-01 13:01:17,943] ERROR [Replica Manager on Broker 45]: Error when
processing fetch request for partition [logs.conv_expired,20] offset 454
from consumer with correlation id 0. Possible cause: Request for offset 454
but we only have log segments in the range 1349769 to 1476231.
(kafka.server.ReplicaManager)

[2015-09-01 13:21:23,896] INFO Partition [logs.avro_event,29] on broker 61:
Expanding ISR for partition [logs.avro_event,29] from 61,77 to 61,77,45
(kafka.cluster.Partition)
[2015-09-01 13:21:23,899] INFO Partition [logs.imp_tstvssamza,6] on broker
61: Expanding ISR for partition [logs.imp_tstvssamza,6] from 61,77 to
61,77,45 (kafka.cluster.Partition)
[2015-09-01 13:21:23,902] INFO Partition [__consumer_offsets,30] on broker
61: Expanding ISR for partition [__consumer_offsets,30] from 61,77 to
61,77,45 (kafka.cluster.Partition)
[2015-09-01 13:21:23,905] INFO Partition [logs.test_imp,44] on broker 61:
Expanding ISR for partition [logs.test_imp,44] from 61 to 61,45
(kafka.cluster.Partition)

Looks like we lost part of our data.

Also kafka started to replicating random partitions (bad broker was already
up and running, log recovery was completed):

root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc -l
Tue Sep  1 13:02:24 UTC 2015
431
root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc -l
Tue Sep  1 13:02:37 UTC 2015
386
root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc -l
Tue Sep  1 13:02:48 UTC 2015
501
root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc -l
Tue Sep  1 13:02:58 UTC 2015
288
root@kafka2d:~# date && /usr/lib/kafka/bin/kafka-topics.sh --zookeeper
zk-pool.gce-eu.kafka/kafka --under-replicated-partitions --describe  | wc -l
Tue Sep  1 13:03:08 UTC 2015
363

Could anyone throw some light on this situation?

We use ext4 on our brokers and these settings:

port=9092
num.network.threads=2
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/mnt/kafka/kafka-data
num.partitions=1
default.replication.factor=2
message.max.bytes=10000000
replica.fetch.max.bytes=10000000
auto.create.topics.enable=false
log.roll.hours=24
num.replica.fetchers=4
auto.leader.rebalance.enable=true
log.retention.hours=168
log.segment.bytes=134217728
log.retention.check.interval.ms=60000
log.cleaner.enable=false
delete.topic.enable=true
zookeeper.connect=zk1d.gce-eu.kafka:2181,zk2d.gce-eu.kafka:2181,zk3d.gce-eu.kafka:2181/kafka
zookeeper.connection.timeout.ms=6000

Shall I do something with these parameters or cluster with 3 brokers and
with replica-factor=2 should prevent such issues?

log.flush.interval.ms
log.flush.interval.messageslog.flush.scheduler.interval.ms

THX!


-- 
Best regards,
Gleb Zhukov

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message