kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arya Ketan <ketan.a...@gmail.com>
Subject Re: Issue with kafka cluster with broker not able to send messages.
Date Thu, 10 Apr 2014 07:52:29 GMT
Hi Guozhang,

I just have 1 producer client per producer machine.
The producer is in a singleton scope.

Is there a possibility to close producer sockets by force or use a producer
socket pool??


--

Arya

Arya


On Thu, Apr 10, 2014 at 11:38 AM, Guozhang Wang <wangguoz@gmail.com> wrote:

> Hello Arya,
>
> The broker seems dead due to too many open file handlers, which are likely
> due to too many open sockets. Hhow many producer clientss do you have on
> these 5 machines, and could you check if there is any socket leak?
>
> Guozhang
>
>
> On Wed, Apr 9, 2014 at 8:50 PM, Arya Ketan <ketan.arya@gmail.com> wrote:
>
> > *Issue : *Kafka cluster goes to an un-responsive state after some time
> > with producers getting Socket time-outs on every request made.
> >
> > *Kafka Version* - 0.8.1
> >
> > *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
> >
> > ulimit -a
> > core file size          (blocks, -c) 0
> > data seg size           (kbytes, -d) unlimited
> > scheduling priority             (-e) 0
> > file size               (blocks, -f) unlimited
> > pending signals                 (-i) 16382
> > max locked memory       (kbytes, -l) 64
> > max memory size         (kbytes, -m) unlimited
> > open files                      (-n) 10240
> > pipe size            (512 bytes, -p) 8
> > POSIX message queues     (bytes, -q) 819200
> > real-time priority              (-r) 0
> > stack size              (kbytes, -s) 8192
> > cpu time               (seconds, -t) unlimited
> > max user processes              (-u) unlimited
> > virtual memory          (kbytes, -v) unlimited
> > file locks                      (-x) unlimited
> >
> > *Broker Properties :*
> >
> > 3 topics, 5 partitions, 5  i/o , network threads.
> > socket.send.buffer.bytes=1048576
> > socket.receive.buffer.bytes=1048576
> > socket.request.max.bytes=104857600
> > zookeeper.connection.timeout.ms=1000000
> > No replication amongst brokers.
> >
> > *Producer Properties * -
> > sync producer,
> >  required.ack = -1 ( Waits for ack from broker, most durable).
> > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> > (requested 102400).
> >
> > *NO Consumers*
> >
> > *QPS From producer to broker* ..
> > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.
>  Each
> > Message size is about 600 bytes
> >
> > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> > message size is about 1 kb.
> >
> > *Exception seen on producer after 9-10 hours of seamless message send.*
> >
> >  Exception
> > kafka.common.FailedToSendMessageException: Failed to send messages after
> 3
> > tries.
> >
> > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send
> requests
> > for topics user with correlation ids in [25617,25932]
> >
> >  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> > kafka.producer.async.DefaultEventHandler - Failed to send producer
> request
> > with correlation id 25931 to broker 4 with data for partitions
> > [userservice,2]
> > java.net.SocketTimeoutException
> >         at
> > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
> >         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
> >         at
> >
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
> >         at kafka.utils.Utils$.read(Utils.scala:375)
> >
> >
> > *No of Open FDs in one of the broker machines*- 999
> >
> > *Netstat of kafka broker :*
> >
> > Lot  of sockets in CLOSE_WAIT with high receive queue nos.
> >
> > Sample :
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> > tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> > tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> > tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> > tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> > tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> > tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> > tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> > CLOSE_WAIT
> > tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> > CLOSE_WAIT
> >
> > *On producer boxes :*
> >
> > Lot of FIN_WAIT 1 ,2 with high send queues.
> > tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> > ESTABLISHED
> > tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> > FIN_WAIT1
> >
> > tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
> >
> > tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> > ESTABLISHED
> > tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> > FIN_WAIT1
> >
> >
> > *Broker Logs : *
> > In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> > connection to /10.65.40.137.
> >
> > EXCEPT in 1 machine :
> >
> > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> > kafka.network.Processor - Closing socket connection to /10.65.40.137.
> >
> > No logs in-between. First fatal exception at 07.11.
> >
> > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error
> writing
> > to highwatermark file:
> > java.io.FileNotFoundException:
> > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many
> open
> > files)
> >         at java.io.FileOutputStream.open(Native Method)
> >         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
> >         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
> >         at java.io.FileWriter.<init>(FileWriter.java:73)
> >         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
> >         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
> >         at
> >
> kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
> >         at
> kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> >         at
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> >         at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >         at java.lang.Thread.run(Thread.java:662)
> > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor
> > - Error in acceptor
> > java.io.IOException: Too many open files
> >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> >         at
> >
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
> >         at kafka.network.Acceptor.accept(SocketServer.scala:200)
> >         at kafka.network.Acceptor.run(SocketServer.scala:154)
> >         at java.lang.Thread.run(Thread.java:662)
> > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> > 'kafka-recovery-point-checkpoint'
> > java.io.FileNotFoundException:
> > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> > open files)
> >
> >
> > *No EXCEPTIONS in any other logs in any other broker machine..*
> >
> > Attached is the thread dump from leader.
> >
>
>
>
> --
> -- Guozhang
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message