kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arya Ketan <ketan.a...@gmail.com>
Subject Issue with kafka cluster with broker not able to send messages.
Date Thu, 10 Apr 2014 03:50:51 GMT
*Issue : *Kafka cluster goes to an un-responsive state after some time with
producers getting Socket time-outs on every request made.

*Kafka Version* - 0.8.1

*Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 10240
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

*Broker Properties :*

3 topics, 5 partitions, 5  i/o , network threads.
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
zookeeper.connection.timeout.ms=1000000
No replication amongst brokers.

*Producer Properties * -
sync producer,
 required.ack = -1 ( Waits for ack from broker, most durable).
kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
(requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
(requested 102400).

*NO Consumers*

*QPS From producer to broker* ..
>From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.  Each
Message size is about 600 bytes

>From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
message size is about 1 kb.

*Exception seen on producer after 9-10 hours of seamless message send.*

 Exception
kafka.common.FailedToSendMessageException: Failed to send messages after 3
tries.

ERROR : kafka.producer.async.DefaultEventHandler - Failed to send requests
for topics user with correlation ids in [25617,25932]

 WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
kafka.producer.async.DefaultEventHandler - Failed to send producer request
with correlation id 25931 to broker 4 with data for partitions
[userservice,2]
java.net.SocketTimeoutException
        at
sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
        at
java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
        at kafka.utils.Utils$.read(Utils.scala:375)


*No of Open FDs in one of the broker machines*- 999

*Netstat of kafka broker :*

Lot  of sockets in CLOSE_WAIT with high receive queue nos.

Sample :
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
CLOSE_WAIT
tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
CLOSE_WAIT

*On producer boxes :*

Lot of FIN_WAIT 1 ,2 with high send queues.
tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
FIN_WAIT2
tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
FIN_WAIT2
tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
ESTABLISHED
tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
FIN_WAIT1

tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED

tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
FIN_WAIT2
tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
FIN_WAIT2
tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
ESTABLISHED
tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
FIN_WAIT1


*Broker Logs : *
In server.log, mostly  INFO [2014-04-10 03:01:42,230]
[kafka-processor-9092-0][] kafka.network.Processor - Closing socket
connection to /10.65.40.137.

EXCEPT in 1 machine :

INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
kafka.network.Processor - Closing socket connection to /10.65.40.137.

No logs in-between. First fatal exception at 07.11.

FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error writing
to highwatermark file:
java.io.FileNotFoundException:
/var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many open
files)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
        at java.io.FileWriter.<init>(FileWriter.java:73)
        at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
        at
kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
        at
kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
        at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
        at
kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
        at
kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
        at kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor -
Error in acceptor
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
        at kafka.network.Acceptor.accept(SocketServer.scala:200)
        at kafka.network.Acceptor.run(SocketServer.scala:154)
        at java.lang.Thread.run(Thread.java:662)
ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
'kafka-recovery-point-checkpoint'
java.io.FileNotFoundException:
/var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
open files)


*No EXCEPTIONS in any other logs in any other broker machine..*

Attached is the thread dump from leader.

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message