kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tousif <tousif.pa...@gmail.com>
Subject Re: kafka brokers going down within 24 hrs
Date Wed, 21 Jan 2015 06:05:08 GMT
any help ?

On Mon, Jan 19, 2015 at 11:43 AM, Tousif <tousif.pasha@gmail.com> wrote:

> Here are the logs from broker id 0 and 1  and it was captured when broker
> 1 went down.
>
> http://paste.ubuntu.com/9782553/
> http://paste.ubuntu.com/9782554/
>
>
> i'm using netty in storm and here are the configs
> storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>
>  storm.messaging.netty.buffer_size: 209715200
>  storm.messaging.netty.max_retries: 10
>  storm.messaging.netty.max_wait_ms: 5000
>  storm.messaging.netty.min_wait_ms: 10000
>
>
>
>
>
>
> On Sat, Jan 17, 2015 at 1:24 AM, Harsha <kafka@harsha.io> wrote:
>
>> Tousif,
>>         I meant to say if kafka broker is going down often its better to
>>         analyze whats the root of cause of the crash.  Using supervisord
>>         to monitor kafka broker is fine, sorry about the confusion.
>> -Harsha
>> On Fri, Jan 16, 2015, at 11:25 AM, Gwen Shapira wrote:
>> > Those errors are expected - if broker 10.0.0.11 went down, it will
>> > reset the connection and the other broker will close the socket.
>> > However, it looks like 10.0.0.11 crashes every two minutes?
>> >
>> > Do you have the logs from 10.0.0.11?
>> >
>> > On Thu, Jan 15, 2015 at 9:51 PM, Tousif <tousif.pasha@gmail.com> wrote:
>> > > i'm using kafka 2.9.2-0.8.1.1 and zookeeper 3.4.6.
>> > > i noticed that only one broker is going down.
>> > >  My message size is less thn 3 kb and  KAFKA_HEAP_OPTS="-Xmx512M"
>> > > and  KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseCompressedOops
>> > > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
>> > > -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC
>> > > -Djava.awt.headless=true" .
>> > >
>> > >  Do you mean kafka broker never goes down and  does broker start
>> > > automatically after failing ?
>> > > I see only these errors on both the brokers.
>> > >
>> > > 10.0.0.11 is the broker which is going down.
>> > >
>> > > ERROR Closing socket for /10.0.0.11 because of error
>> > > (kafka.network.Processor)
>> > > java.io.IOException: Connection reset by peer
>> > > at sun.nio.ch.FileDispatcher.read0(Native Method)
>> > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>> > > at sun.nio.ch.IOUtil.read(IOUtil.java:171)
>> > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>> > > at kafka.utils.Utils$.read(Utils.scala:375)
>> > > at
>> > >
>> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>> > > at kafka.network.Processor.read(SocketServer.scala:347)
>> > > at kafka.network.Processor.run(SocketServer.scala:245)
>> > > at java.lang.Thread.run(Thread.java:662)
>> > > [2015-01-16 11:01:48,173] INFO Closing socket connection to /
>> 10.0.0.11.
>> > > (kafka.network.Processor)
>> > > [2015-01-16 11:03:08,164] ERROR Closing socket for /10.0.0.11
>> because of
>> > > error (kafka.network.Processor)
>> > > java.io.IOException: Connection reset by peer
>> > > at sun.nio.ch.FileDispatcher.read0(Native Method)
>> > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>> > > at sun.nio.ch.IOUtil.read(IOUtil.java:171)
>> > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>> > > at kafka.utils.Utils$.read(Utils.scala:375)
>> > > at
>> > >
>> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>> > > at kafka.network.Processor.read(SocketServer.scala:347)
>> > > at kafka.network.Processor.run(SocketServer.scala:245)
>> > > at java.lang.Thread.run(Thread.java:662)
>> > > [2015-01-16 11:03:08,280] INFO Closing socket connection to /
>> 10.0.0.11.
>> > > (kafka.network.Processor)
>> > > [2015-01-16 11:03:48,369] ERROR Closing socket for /10.0.0.11
>> because of
>> > > error (kafka.network.Processor)
>> > > java.io.IOException: Connection reset by peer
>> > > at sun.nio.ch.FileDispatcher.read0(Native Method)
>> > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>> > > at sun.nio.ch.IOUtil.read(IOUtil.java:171)
>> > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>> > > at kafka.utils.Utils$.read(Utils.scala:375)
>> > > at
>> > >
>> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>> > > at kafka.network.Processor.read(SocketServer.scala:347)
>> > > at kafka.network.Processor.run(SocketServer.scala:245)
>> > > at java.lang.Thread.run(Thread.java:662)
>> > >
>> > >
>> > >
>> > > On Thu, Jan 15, 2015 at 7:49 PM, Harsha <kafka@harsha.io> wrote:
>> > >
>> > >> Tousif,
>> > >>        Which version of kafka and zookeeper are you using and whats
>> your
>> > >>        message size and jvm size that you allocated for kafka
>> brokers.
>> > >> There is only 1 zookeeper node , if its a production cluster I
>> recommend
>> > >> you to have quorum of zookeeper nodes. Both kafka & storm are heavy
>> > >> users of zookeeper. Also supervisord is recommended for storm I am
>> not
>> > >> sure you need to have it for kafka, for storm its the fail-fast
>> nature
>> > >> of workers that requires supervisord to restart.
>> > >> When kafka goes down first time , i.e before supervisord restarts it
>> do
>> > >> you see same OOM error. Check the logs to see why its going down for
>> the
>> > >> first time.
>> > >> -Harsha
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Jan 14, 2015, at 10:50 PM, Tousif wrote:
>> > >> > Hello Chia-Chun Shih,
>> > >> >
>> > >> > There are multiple issues,
>> > >> > First thing is i don't see out of memory error every time and
OOM
>> happens
>> > >> > after supervisord keep retrying to start  kafka.
>> > >> > It goes down when it tries to add partition fetcher
>> > >> >
>> > >> > it starts with
>> > >> >
>> > >> > *conflict in /controller data:
>> > >> > {"version":1,"brokerid":0,"timestamp":"1421296052741"} stored
data:
>> > >> > {"version":1,"brokerid":1,"timestamp":"1421291998088"}
>> > >> > (kafka.utils.ZkUtils$)*
>> > >> >
>> > >> >
>> > >> > ERROR Conditional update of path
>> > >> > /brokers/topics/realtimestreaming/partitions/1/state with data
>> > >> >
>> > >>
>> {"controller_epoch":34,"leader":0,"version":1,"leader_epoch":54,"isr":[0]}
>> > >> > and expected version 90 failed due to
>> > >> > org.apache.zookeeper.KeeperException$BadVersionException:
>> KeeperErrorCode
>> > >> > =
>> > >> > BadVersion for /brokers/topics/realtimestreaming/partitions/1/state
>> > >> > (kafka.utils.ZkUtils$)
>> > >> >
>> > >> > and then
>> > >> >
>> > >> > [ReplicaFetcherManager on broker 0] Removed fetcher for partitions
>> > >> > [realtimestreaming,0],[realtimestreaming,1]
>> > >> > (kafka.server.ReplicaFetcherManager)
>> > >> > [2015-01-15 09:57:34,350] INFO Truncating log realtimestreaming-0
>> to
>> > >> > offset
>> > >> > 846. (kafka.log.Log)
>> > >> > [2015-01-15 09:57:34,351] INFO Truncating log realtimestreaming-1
>> to
>> > >> > offset
>> > >> > 957. (kafka.log.Log)
>> > >> > [2015-01-15 09:57:34,650] INFO [ReplicaFetcherManager on broker
0]
>> *Added
>> > >> > fetcher for partitions ArrayBuffer([[realtimestreaming,0],
>> initOffset 846
>> > >> > to broker id:1,host:realtimeslave1.novalocal,port:9092] ,
>> > >> > [[realtimestreaming,1], initOffset 957 to broker
>> > >> > id:1,host:realtimeslave1.novalocal,port:9092] )
>> > >> > (kafka.server.ReplicaFetcherManager)*
>> > >> > [2015-01-15 09:57:34,654] INFO [ReplicaFetcherThread-0-1], Starting
>> > >> >  (kafka.server.ReplicaFetcherThread)
>> > >> > [2015-01-15 09:57:34,747] INFO [ReplicaFetcherThread-1-1], Starting
>> > >> >  (kafka.server.ReplicaFetcherThread)
>> > >> > [2015-01-15 09:58:14,156] INFO Closing socket connection to /
>> 10.0.0.11.
>> > >> > (kafka.network.Processor)
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Thu, Jan 15, 2015 at 12:01 PM, Chia-Chun Shih
>> > >> > <chiachun.shih@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > You can use tools (e.g., VisialVM) to diagnose OOM problem.
>> > >> > >
>> > >> > > 2015-01-15 14:15 GMT+08:00 Tousif Khazi <tousif@senseforth.com>:
>> > >> > >
>> > >> > > > i see this error
>> > >> > > >
>> > >> > > >  ERROR [ReplicaFetcherThread-0-1], Error for partition
>> > >> > > > [realtimestreaming,1] to broker 1:class
>> > >> > > > kafka.common.NotLeaderForPartitionException
>> > >> > > > (kafka.server.ReplicaFetcherThread)
>> > >> > > > [2015-01-15 10:00:04,348] INFO [ReplicaFetcherManager
on
>> broker 0]
>> > >> > > > Removed fetcher for partitions [realtimestreaming,1]
>> > >> > > > (kafka.server.ReplicaFetcherManager)
>> > >> > > > [2015-01-15 10:00:04,355] INFO Closing socket connection
to
>> > >> > > > /10.0.0.11. (kafka.network.Processor)
>> > >> > > > [2015-01-15 10:00:04,444] WARN [KafkaApi-0] Fetch request
with
>> > >> > > > correlation id 0 from client ReplicaFetcherThread-0-0
on
>> partition
>> > >> > > > [realtimestreaming,1] failed due to Leader not local
for
>> partition
>> > >> > > > [realtimestreaming,1] on broker 0 (kafka.server.KafkaApis)
>> > >> > > > [2015-01-15 10:00:04,545] INFO [ReplicaFetcherThread-0-1],
>> Shutting
>> > >> > > > down (kafka.server.ReplicaFetcherThread)
>> > >> > > > [2015-01-15 10:00:04,848] INFO [ReplicaFetcherThread-0-1],
>> Stopped
>> > >> > > > (kafka.server.ReplicaFetcherThread)
>> > >> > > > [2015-01-15 10:00:04,849] INFO [ReplicaFetcherThread-0-1],
>> Shutdown
>> > >> > > > completed (kafka.server.ReplicaFetcherThread)
>> > >> > > > [2015-01-15 10:00:39,256] ERROR Closing socket for /10.0.0.11
>> > >> because
>> > >> > > > of error (kafka.network.Processor)
>> > >> > > > java.io.IOException: Connection reset by peer
>> > >> > > > at sun.nio.ch.FileDispatcher.read0(Native Method)
>> > >> > > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> > >> > > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>> > >> > > > at sun.nio.ch.IOUtil.read(IOUtil.java:171)
>> > >> > > > at
>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>> > >> > > >
>> > >> > > > On Wed, Jan 14, 2015 at 10:12 PM, Tousif <
>> tousif.pasha@gmail.com>
>> > >> wrote:
>> > >> > > > > Thanks harsha for quick response.
>> > >> > > > > I don't see any other error. I used to see replica
fetcher
>> error
>> > >> but
>> > >> > > > seems
>> > >> > > > > to be disappeared after setting replica fetcher
threads to 2
>> as I
>> > >> have
>> > >> > > 2
>> > >> > > > > partitions. Some times I see zookeeper session
expiration.
>> > >> > > > > On Jan 14, 2015 8:02 PM, "Harsha" <kafka@harsha.io>
wrote:
>> > >> > > > >
>> > >> > > > >> Tousif,
>> > >> > > > >>        Do you see any other errors in server.log
>> > >> > > > >> -Harsha
>> > >> > > > >>
>> > >> > > > >> On Wed, Jan 14, 2015, at 01:51 AM, Tousif wrote:
>> > >> > > > >> > Hello,
>> > >> > > > >> >
>> > >> > > > >> > I have configured kafka nodes to run via
supervisord  and
>> see
>> > >> > > > following
>> > >> > > > >> > exceptions
>> > >> > > > >> > and eventually brokers going out of memory.
i have given
>> enough
>> > >> > > memory
>> > >> > > > >> > and
>> > >> > > > >> > process 1 event/second. kafka goes down
every day
>> > >> > > > >> >
>> > >> > > > >> > i'm wondering what configurastion is missing
or need to
>> be added
>> > >> > > > >> >
>> > >> > > > >> > Here are my cluster details:
>> > >> > > > >> >  2 brokers
>> > >> > > > >> >  1 zookeeper
>> > >> > > > >> > and 2 node apache storm
>> > >> > > > >> >
>> > >> > > > >> >
>> > >> > > > >> > INFO zookeeper state changed (SyncConnected)
>> > >> > > > >> > (org.I0Itec.zkclient.ZkClient)
>> > >> > > > >> > ERROR Closing socket for /10.0.0.11 because
of error
>> > >> > > > >> > (kafka.network.Processor)
>> > >> > > > >> > java.io.IOException: Connection reset
by peer
>> > >> > > > >> > at sun.nio.ch.FileDispatcher.read0(Native
Method)
>> > >> > > > >> > at
>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> > >> > > > >> > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>> > >> > > > >> > at sun.nio.ch.IOUtil.read(IOUtil.java:171)
>> > >> > > > >> > at
>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>> > >> > > > >> > at kafka.utils.Utils$.read(Utils.scala:375)
>> > >> > > > >> > at
>> > >> > > > >> >
>> > >> > > > >>
>> > >> > > >
>> > >> > >
>> > >>
>> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>> > >> > > > >> > at kafka.network.Processor.read(SocketServer.scala:347)
>> > >> > > > >> > at kafka.network.Processor.run(SocketServer.scala:245)
>> > >> > > > >> > at java.lang.Thread.run(Thread.java:662)
>> > >> > > > >> > [2015-01-13 23:43:37,962] INFO Closing
socket connection
>> to /
>> > >> > > > 10.0.0.11.
>> > >> > > > >> > (kafka.network.Processor)
>> > >> > > > >> > Error occurred during initialization of
VM
>> > >> > > > >> > Could not reserve enough space for object
heap
>> > >> > > > >> > Error occurred during initialization of
VM
>> > >> > > > >> > Could not reserve enough space for object
heap
>> > >> > > > >> >
>> > >> > > > >> >
>> > >> > > > >> >
>> > >> > > > >> >
>> > >> > > > >> > --
>> > >> > > > >> > Regards,
>> > >> > > > >> > Tousif
>> > >> > > > >> > +918050227279
>> > >> > > > >> >
>> > >> > > > >> >
>> > >> > > > >> > --
>> > >> > > > >> >
>> > >> > > > >> >
>> > >> > > > >> > Regards
>> > >> > > > >> > Tousif Khazi
>> > >> > > > >>
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > > --
>> > >> > > > Regards,
>> > >> > > > Tousif
>> > >> > > > +918050227279
>> > >> > > >
>> > >> > >
>> > >> >
>> > >> >
>> > >> >
>> > >> > --
>> > >> >
>> > >> >
>> > >> > Regards
>> > >> > Tousif Khazi
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > >
>> > > Regards
>> > > Tousif Khazi
>>
>
>
>
> --
>
>
> Regards
> Tousif Khazi
>
>


-- 


Regards
Tousif Khazi

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message