kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Koshy <jjkosh...@gmail.com>
Subject Re: Issue with kafka cluster with broker not able to send messages.
Date Fri, 11 Apr 2014 00:20:58 GMT
When you see this happening (on broker 4 in this instance), can you
confirm the Kafka process handle limit?

cat /proc/<pid>/limits

On Thu, Apr 10, 2014 at 09:20:51AM +0530, Arya Ketan wrote:
> *Issue : *Kafka cluster goes to an un-responsive state after some time with
> producers getting Socket time-outs on every request made.
> 
> *Kafka Version* - 0.8.1
> 
> *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
> 
> ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 16382
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 10240
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) unlimited
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> 
> *Broker Properties :*
> 
> 3 topics, 5 partitions, 5  i/o , network threads.
> socket.send.buffer.bytes=1048576
> socket.receive.buffer.bytes=1048576
> socket.request.max.bytes=104857600
> zookeeper.connection.timeout.ms=1000000
> No replication amongst brokers.
> 
> *Producer Properties * -
> sync producer,
>  required.ack = -1 ( Waits for ack from broker, most durable).
> kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> (requested 102400).
> 
> *NO Consumers*
> 
> *QPS From producer to broker* ..
> From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.  Each
> Message size is about 600 bytes
> 
> From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> message size is about 1 kb.
> 
> *Exception seen on producer after 9-10 hours of seamless message send.*
> 
>  Exception
> kafka.common.FailedToSendMessageException: Failed to send messages after 3
> tries.
> 
> ERROR : kafka.producer.async.DefaultEventHandler - Failed to send requests
> for topics user with correlation ids in [25617,25932]
> 
>  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> kafka.producer.async.DefaultEventHandler - Failed to send producer request
> with correlation id 25931 to broker 4 with data for partitions
> [userservice,2]
> java.net.SocketTimeoutException
>         at
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>         at
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
>         at kafka.utils.Utils$.read(Utils.scala:375)
> 
> 
> *No of Open FDs in one of the broker machines*- 999
> 
> *Netstat of kafka broker :*
> 
> Lot  of sockets in CLOSE_WAIT with high receive queue nos.
> 
> Sample :
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> CLOSE_WAIT
> tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> CLOSE_WAIT
> 
> *On producer boxes :*
> 
> Lot of FIN_WAIT 1 ,2 with high send queues.
> tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> ESTABLISHED
> tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> FIN_WAIT1
> 
> tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
> 
> tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> ESTABLISHED
> tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> FIN_WAIT1
> 
> 
> *Broker Logs : *
> In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> connection to /10.65.40.137.
> 
> EXCEPT in 1 machine :
> 
> INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> kafka.network.Processor - Closing socket connection to /10.65.40.137.
> 
> No logs in-between. First fatal exception at 07.11.
> 
> FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error writing
> to highwatermark file:
> java.io.FileNotFoundException:
> /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many open
> files)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
>         at java.io.FileWriter.<init>(FileWriter.java:73)
>         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
>         at
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
>         at
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
>         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
>         at
> kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
>         at
> kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
>         at kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor -
> Error in acceptor
> java.io.IOException: Too many open files
>         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>         at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
>         at kafka.network.Acceptor.accept(SocketServer.scala:200)
>         at kafka.network.Acceptor.run(SocketServer.scala:154)
>         at java.lang.Thread.run(Thread.java:662)
> ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> 'kafka-recovery-point-checkpoint'
> java.io.FileNotFoundException:
> /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> open files)
> 
> 
> *No EXCEPTIONS in any other logs in any other broker machine..*
> 
> Attached is the thread dump from leader.



Mime
View raw message