kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arya Ketan <ketan.a...@gmail.com>
Subject Re: Issue with kafka cluster with broker not able to send messages.
Date Fri, 11 Apr 2014 02:43:34 GMT
By the way, we may have found the issue ..


On going through the thread dump, we found that 4-5 threads were blocked on
log4j.CallAppenders and 2-3 threads were in IN_NATIVE state while trying to
write logs to disk. The network threads were there-fore blocked on log4j
threads, thus hanging the kafka broker.

Also in logging, in our log4j file, we have a console appender.. which
writes to STDOUT and STDERR,  we suspected that while writing to stdout and
stderr, the threads are getting blocked. So, we removed the console
appender and started writing to a file. Its stable for quite some time. (
Fingers crossed).


*Process limits : *
Max cpu time              unlimited            unlimited
seconds
Max file size             unlimited            unlimited
bytes
Max data size             unlimited            unlimited
bytes
Max stack size            10485760             unlimited
bytes
Max core file size        0                    unlimited
bytes
Max resident set          unlimited            unlimited
bytes
Max processes             1031009              1031009
processes
Max open files            1024                 1024
files
Max locked memory         65536                65536
bytes
Max address space         unlimited            unlimited
bytes
Max file locks            unlimited            unlimited
locks
Max pending signals       1031009              1031009
signals
Max msgqueue size         819200               819200
bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us




Arya


On Fri, Apr 11, 2014 at 5:50 AM, Joel Koshy <jjkoshy.w@gmail.com> wrote:

> When you see this happening (on broker 4 in this instance), can you
> confirm the Kafka process handle limit?
>
> cat /proc/<pid>/limits
>
> On Thu, Apr 10, 2014 at 09:20:51AM +0530, Arya Ketan wrote:
> > *Issue : *Kafka cluster goes to an un-responsive state after some time
> with
> > producers getting Socket time-outs on every request made.
> >
> > *Kafka Version* - 0.8.1
> >
> > *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
> >
> > ulimit -a
> > core file size          (blocks, -c) 0
> > data seg size           (kbytes, -d) unlimited
> > scheduling priority             (-e) 0
> > file size               (blocks, -f) unlimited
> > pending signals                 (-i) 16382
> > max locked memory       (kbytes, -l) 64
> > max memory size         (kbytes, -m) unlimited
> > open files                      (-n) 10240
> > pipe size            (512 bytes, -p) 8
> > POSIX message queues     (bytes, -q) 819200
> > real-time priority              (-r) 0
> > stack size              (kbytes, -s) 8192
> > cpu time               (seconds, -t) unlimited
> > max user processes              (-u) unlimited
> > virtual memory          (kbytes, -v) unlimited
> > file locks                      (-x) unlimited
> >
> > *Broker Properties :*
> >
> > 3 topics, 5 partitions, 5  i/o , network threads.
> > socket.send.buffer.bytes=1048576
> > socket.receive.buffer.bytes=1048576
> > socket.request.max.bytes=104857600
> > zookeeper.connection.timeout.ms=1000000
> > No replication amongst brokers.
> >
> > *Producer Properties * -
> > sync producer,
> >  required.ack = -1 ( Waits for ack from broker, most durable).
> > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> > (requested 102400).
> >
> > *NO Consumers*
> >
> > *QPS From producer to broker* ..
> > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.
>  Each
> > Message size is about 600 bytes
> >
> > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> > message size is about 1 kb.
> >
> > *Exception seen on producer after 9-10 hours of seamless message send.*
> >
> >  Exception
> > kafka.common.FailedToSendMessageException: Failed to send messages after
> 3
> > tries.
> >
> > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send
> requests
> > for topics user with correlation ids in [25617,25932]
> >
> >  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> > kafka.producer.async.DefaultEventHandler - Failed to send producer
> request
> > with correlation id 25931 to broker 4 with data for partitions
> > [userservice,2]
> > java.net.SocketTimeoutException
> >         at
> > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
> >         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
> >         at
> >
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
> >         at kafka.utils.Utils$.read(Utils.scala:375)
> >
> >
> > *No of Open FDs in one of the broker machines*- 999
> >
> > *Netstat of kafka broker :*
> >
> > Lot  of sockets in CLOSE_WAIT with high receive queue nos.
> >
> > Sample :
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> > tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> > tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> > tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> > tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> > tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> > tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> > tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> > CLOSE_WAIT
> > tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> > CLOSE_WAIT
> >
> > *On producer boxes :*
> >
> > Lot of FIN_WAIT 1 ,2 with high send queues.
> > tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> > ESTABLISHED
> > tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> > FIN_WAIT1
> >
> > tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
> >
> > tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> > ESTABLISHED
> > tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> > FIN_WAIT1
> >
> >
> > *Broker Logs : *
> > In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> > connection to /10.65.40.137.
> >
> > EXCEPT in 1 machine :
> >
> > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> > kafka.network.Processor - Closing socket connection to /10.65.40.137.
> >
> > No logs in-between. First fatal exception at 07.11.
> >
> > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error
> writing
> > to highwatermark file:
> > java.io.FileNotFoundException:
> > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many
> open
> > files)
> >         at java.io.FileOutputStream.open(Native Method)
> >         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
> >         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
> >         at java.io.FileWriter.<init>(FileWriter.java:73)
> >         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
> >         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
> >         at
> >
> kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
> >         at
> kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> >         at
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> >         at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >         at java.lang.Thread.run(Thread.java:662)
> > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][]
> kafka.network.Acceptor -
> > Error in acceptor
> > java.io.IOException: Too many open files
> >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> >         at
> >
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
> >         at kafka.network.Acceptor.accept(SocketServer.scala:200)
> >         at kafka.network.Acceptor.run(SocketServer.scala:154)
> >         at java.lang.Thread.run(Thread.java:662)
> > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> > 'kafka-recovery-point-checkpoint'
> > java.io.FileNotFoundException:
> > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> > open files)
> >
> >
> > *No EXCEPTIONS in any other logs in any other broker machine..*
> >
> > Attached is the thread dump from leader.
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message