kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmy Yulrizka <a...@yulrizka.com>
Subject Possibly leaking socket on ReplicaFetcherThread
Date Tue, 21 Jan 2014 11:29:42 GMT
We are running 3 kafka nodes, which servers 4 partition.
We have been experiencing weird behavior during network outage.

we had been experiencing twice in the last couple of days. the previous one
took down all of the cluster.
while this one only 2 out of 3 survive. and 1 node became the leader of all
partition, and other node only in ISR of 1 partition (out of 4)

my best guess now is that when the network down, the broker can't connect
to other broker to do replication and keep opening the socket
without closing it. But I'm not entirely sure about this.

Is there any way to mitigate the problem ? or is there any configuration
options to stop this from happening again ?


The java/kafka process open too many socket file descriptor.
running `lsof -a -p 11818` yield thousand of this line.

...
java    11818 kafka 3059u  sock                0,7       0t0 615637305
can't identify protocol
java    11818 kafka 3060u  sock                0,7       0t0 615637306
can't identify protocol
java    11818 kafka 3061u  sock                0,7       0t0 615637307
can't identify protocol
java    11818 kafka 3062u  sock                0,7       0t0 615637308
can't identify protocol
java    11818 kafka 3063u  sock                0,7       0t0 615637309
can't identify protocol
java    11818 kafka 3064u  sock                0,7       0t0 615637310
can't identify protocol
java    11818 kafka 3065u  sock                0,7       0t0 615637311
can't identify protocol
...

i verify that the the open socket did not close when i repeated the command
after 2 minutes.


and the kafka log on the broken node, generate lots of error like this:

[2014-01-21 04:21:48,819]  64573925 [kafka-acceptor] ERROR
kafka.network.Acceptor  - Error in acceptor
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:165)
        at kafka.network.Acceptor.accept(SocketServer.scala:200)
        at kafka.network.Acceptor.run(SocketServer.scala:154)
        at java.lang.Thread.run(Thread.java:701)
[2014-01-21 04:21:48,819]  64573925 [kafka-acceptor] ERROR
kafka.network.Acceptor  - Error in acceptor
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:165)
        at kafka.network.Acceptor.accept(SocketServer.scala:200)
        at kafka.network.Acceptor.run(SocketServer.scala:154)
        at java.lang.Thread.run(Thread.java:701)
[2014-01-21 04:21:48,811]  64573917 [ReplicaFetcherThread-0-1] INFO
 kafka.consumer.SimpleConsumer  - Reconnect due to socket error: null
[2014-01-21 04:21:48,819]  64573925 [ReplicaFetcherThread-0-1] WARN
 kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-1], Error in
fetch Name: FetchRequest; Version: 0; CorrelationId: 74930218; ClientId:
ReplicaFetcherThread-0-1; ReplicaId: 2; MaxWait: 500 ms; MinBytes: 1 bytes;
RequestInfo: [some-topic,0] ->
PartitionFetchInfo(959825,1048576),[some-topic,3] ->
PartitionFetchInfo(551546,1048576)
java.net.SocketException: Too many open files
        at sun.nio.ch.Net.socket0(Native Method)
        at sun.nio.ch.Net.socket(Net.java:156)
        at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:102)
        at
sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:55)
        at java.nio.channels.SocketChannel.open(SocketChannel.java:122)
        at kafka.network.BlockingChannel.connect(BlockingChannel.scala:48)
        at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
        at kafka.consumer.SimpleConsumer.reconnect(SimpleConsumer.scala:57)
        at
kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:79)
        at
kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:110)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:110)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:110)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:109)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:109)
        at
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:109)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
        at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:108)
        at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:94)
        at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:86)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)


--
Ahmy Yulrizka
http://ahmy.yulrizka.com
@yulrizka

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message