kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Koshy <jjkosh...@gmail.com>
Subject Re: Issue with kafka cluster with broker not able to send messages.
Date Fri, 11 Apr 2014 17:36:43 GMT
Interesting - can you share those thread dumps?

Also, your process file handle limit is extremely low (and different
from what seems to be the default for your user-id from your earlier
message). You should bump that up to a couple of ten thousand at
least.

Joel

On Fri, Apr 11, 2014 at 08:13:34AM +0530, Arya Ketan wrote:
> By the way, we may have found the issue ..
> 
> 
> On going through the thread dump, we found that 4-5 threads were blocked on
> log4j.CallAppenders and 2-3 threads were in IN_NATIVE state while trying to
> write logs to disk. The network threads were there-fore blocked on log4j
> threads, thus hanging the kafka broker.
> 
> Also in logging, in our log4j file, we have a console appender.. which
> writes to STDOUT and STDERR,  we suspected that while writing to stdout and
> stderr, the threads are getting blocked. So, we removed the console
> appender and started writing to a file. Its stable for quite some time. (
> Fingers crossed).
> 
> 
> *Process limits : *
> Max cpu time              unlimited            unlimited
> seconds
> Max file size             unlimited            unlimited
> bytes
> Max data size             unlimited            unlimited
> bytes
> Max stack size            10485760             unlimited
> bytes
> Max core file size        0                    unlimited
> bytes
> Max resident set          unlimited            unlimited
> bytes
> Max processes             1031009              1031009
> processes
> Max open files            1024                 1024
> files
> Max locked memory         65536                65536
> bytes
> Max address space         unlimited            unlimited
> bytes
> Max file locks            unlimited            unlimited
> locks
> Max pending signals       1031009              1031009
> signals
> Max msgqueue size         819200               819200
> bytes
> Max nice priority         0                    0
> Max realtime priority     0                    0
> Max realtime timeout      unlimited            unlimited            us
> 
> 
> 
> 
> Arya
> 
> 
> On Fri, Apr 11, 2014 at 5:50 AM, Joel Koshy <jjkoshy.w@gmail.com> wrote:
> 
> > When you see this happening (on broker 4 in this instance), can you
> > confirm the Kafka process handle limit?
> >
> > cat /proc/<pid>/limits
> >
> > On Thu, Apr 10, 2014 at 09:20:51AM +0530, Arya Ketan wrote:
> > > *Issue : *Kafka cluster goes to an un-responsive state after some time
> > with
> > > producers getting Socket time-outs on every request made.
> > >
> > > *Kafka Version* - 0.8.1
> > >
> > > *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
> > >
> > > ulimit -a
> > > core file size          (blocks, -c) 0
> > > data seg size           (kbytes, -d) unlimited
> > > scheduling priority             (-e) 0
> > > file size               (blocks, -f) unlimited
> > > pending signals                 (-i) 16382
> > > max locked memory       (kbytes, -l) 64
> > > max memory size         (kbytes, -m) unlimited
> > > open files                      (-n) 10240
> > > pipe size            (512 bytes, -p) 8
> > > POSIX message queues     (bytes, -q) 819200
> > > real-time priority              (-r) 0
> > > stack size              (kbytes, -s) 8192
> > > cpu time               (seconds, -t) unlimited
> > > max user processes              (-u) unlimited
> > > virtual memory          (kbytes, -v) unlimited
> > > file locks                      (-x) unlimited
> > >
> > > *Broker Properties :*
> > >
> > > 3 topics, 5 partitions, 5  i/o , network threads.
> > > socket.send.buffer.bytes=1048576
> > > socket.receive.buffer.bytes=1048576
> > > socket.request.max.bytes=104857600
> > > zookeeper.connection.timeout.ms=1000000
> > > No replication amongst brokers.
> > >
> > > *Producer Properties * -
> > > sync producer,
> > >  required.ack = -1 ( Waits for ack from broker, most durable).
> > > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> > > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> > > (requested 102400).
> > >
> > > *NO Consumers*
> > >
> > > *QPS From producer to broker* ..
> > > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.
> >  Each
> > > Message size is about 600 bytes
> > >
> > > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> > > message size is about 1 kb.
> > >
> > > *Exception seen on producer after 9-10 hours of seamless message send.*
> > >
> > >  Exception
> > > kafka.common.FailedToSendMessageException: Failed to send messages after
> > 3
> > > tries.
> > >
> > > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send
> > requests
> > > for topics user with correlation ids in [25617,25932]
> > >
> > >  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> > > kafka.producer.async.DefaultEventHandler - Failed to send producer
> > request
> > > with correlation id 25931 to broker 4 with data for partitions
> > > [userservice,2]
> > > java.net.SocketTimeoutException
> > >         at
> > > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
> > >         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
> > >         at
> > >
> > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
> > >         at kafka.utils.Utils$.read(Utils.scala:375)
> > >
> > >
> > > *No of Open FDs in one of the broker machines*- 999
> > >
> > > *Netstat of kafka broker :*
> > >
> > > Lot  of sockets in CLOSE_WAIT with high receive queue nos.
> > >
> > > Sample :
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> > > tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> > > CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> > > tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> > > tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> > > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> > > tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> > > CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> > > tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> > > CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> > > tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> > > CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> > > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> > > tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> > > CLOSE_WAIT
> > > tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> > > CLOSE_WAIT
> > >
> > > *On producer boxes :*
> > >
> > > Lot of FIN_WAIT 1 ,2 with high send queues.
> > > tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> > > FIN_WAIT2
> > > tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> > > FIN_WAIT2
> > > tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> > > ESTABLISHED
> > > tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > >
> > > tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
> > >
> > > tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> > > FIN_WAIT2
> > > tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> > > FIN_WAIT2
> > > tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> > > ESTABLISHED
> > > tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > >
> > >
> > > *Broker Logs : *
> > > In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> > > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> > > connection to /10.65.40.137.
> > >
> > > EXCEPT in 1 machine :
> > >
> > > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> > > kafka.network.Processor - Closing socket connection to /10.65.40.137.
> > >
> > > No logs in-between. First fatal exception at 07.11.
> > >
> > > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> > > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error
> > writing
> > > to highwatermark file:
> > > java.io.FileNotFoundException:
> > > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many
> > open
> > > files)
> > >         at java.io.FileOutputStream.open(Native Method)
> > >         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
> > >         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
> > >         at java.io.FileWriter.<init>(FileWriter.java:73)
> > >         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
> > >         at
> > >
> > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
> > >         at
> > >
> > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
> > >         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
> > >         at
> > >
> > kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
> > >         at
> > >
> > kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
> > >         at
> > kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
> > >         at
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> > >         at
> > >
> > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> > >         at
> > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> > >         at
> > >
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> > >         at
> > >
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> > >         at
> > >
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> > >         at
> > >
> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > >         at
> > >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > >         at java.lang.Thread.run(Thread.java:662)
> > > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][]
> > kafka.network.Acceptor -
> > > Error in acceptor
> > > java.io.IOException: Too many open files
> > >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> > >         at
> > >
> > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
> > >         at kafka.network.Acceptor.accept(SocketServer.scala:200)
> > >         at kafka.network.Acceptor.run(SocketServer.scala:154)
> > >         at java.lang.Thread.run(Thread.java:662)
> > > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> > > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> > > 'kafka-recovery-point-checkpoint'
> > > java.io.FileNotFoundException:
> > > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> > > open files)
> > >
> > >
> > > *No EXCEPTIONS in any other logs in any other broker machine..*
> > >
> > > Attached is the thread dump from leader.
> >
> >
> >


Mime
View raw message