kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guozhang Wang <wangg...@gmail.com>
Subject Re: Issue with kafka cluster with broker not able to send messages.
Date Thu, 10 Apr 2014 06:08:54 GMT
Hello Arya,

The broker seems dead due to too many open file handlers, which are likely
due to too many open sockets. Hhow many producer clientss do you have on
these 5 machines, and could you check if there is any socket leak?

Guozhang


On Wed, Apr 9, 2014 at 8:50 PM, Arya Ketan <ketan.arya@gmail.com> wrote:

> *Issue : *Kafka cluster goes to an un-responsive state after some time
> with producers getting Socket time-outs on every request made.
>
> *Kafka Version* - 0.8.1
>
> *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
>
> ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 16382
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 10240
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) unlimited
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> *Broker Properties :*
>
> 3 topics, 5 partitions, 5  i/o , network threads.
> socket.send.buffer.bytes=1048576
> socket.receive.buffer.bytes=1048576
> socket.request.max.bytes=104857600
> zookeeper.connection.timeout.ms=1000000
> No replication amongst brokers.
>
> *Producer Properties * -
> sync producer,
>  required.ack = -1 ( Waits for ack from broker, most durable).
> kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> (requested 102400).
>
> *NO Consumers*
>
> *QPS From producer to broker* ..
> From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.  Each
> Message size is about 600 bytes
>
> From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> message size is about 1 kb.
>
> *Exception seen on producer after 9-10 hours of seamless message send.*
>
>  Exception
> kafka.common.FailedToSendMessageException: Failed to send messages after 3
> tries.
>
> ERROR : kafka.producer.async.DefaultEventHandler - Failed to send requests
> for topics user with correlation ids in [25617,25932]
>
>  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> kafka.producer.async.DefaultEventHandler - Failed to send producer request
> with correlation id 25931 to broker 4 with data for partitions
> [userservice,2]
> java.net.SocketTimeoutException
>         at
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>         at
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
>         at kafka.utils.Utils$.read(Utils.scala:375)
>
>
> *No of Open FDs in one of the broker machines*- 999
>
> *Netstat of kafka broker :*
>
> Lot  of sockets in CLOSE_WAIT with high receive queue nos.
>
> Sample :
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> CLOSE_WAIT
> tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> CLOSE_WAIT
>
> *On producer boxes :*
>
> Lot of FIN_WAIT 1 ,2 with high send queues.
> tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> ESTABLISHED
> tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> FIN_WAIT1
>
> tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
>
> tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> ESTABLISHED
> tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> FIN_WAIT1
>
>
> *Broker Logs : *
> In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> connection to /10.65.40.137.
>
> EXCEPT in 1 machine :
>
> INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> kafka.network.Processor - Closing socket connection to /10.65.40.137.
>
> No logs in-between. First fatal exception at 07.11.
>
> FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error writing
> to highwatermark file:
> java.io.FileNotFoundException:
> /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many open
> files)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
>         at java.io.FileWriter.<init>(FileWriter.java:73)
>         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
>         at
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
>         at
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
>         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
>         at
> kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
>         at
> kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
>         at kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor
> - Error in acceptor
> java.io.IOException: Too many open files
>         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>         at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
>         at kafka.network.Acceptor.accept(SocketServer.scala:200)
>         at kafka.network.Acceptor.run(SocketServer.scala:154)
>         at java.lang.Thread.run(Thread.java:662)
> ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> 'kafka-recovery-point-checkpoint'
> java.io.FileNotFoundException:
> /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> open files)
>
>
> *No EXCEPTIONS in any other logs in any other broker machine..*
>
> Attached is the thread dump from leader.
>



-- 
-- Guozhang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message