storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabh...@gmail.com>
Subject Re: Exclamation topology dying on remote cluster (Storm 0.10.2)
Date Thu, 24 Aug 2017 07:35:34 GMT
Hello Zachary,

Could you provide nimbus / supervisor logs? Could you use dedicated ZK for
that cluster and see if it helps?

Thanks,
Jungtaek Lim (HeartSaVioR)

2017년 8월 16일 (수) 오전 2:37, Zachary Smith <tud20772@temple.edu>님이 작성:

> Upgraded to 1.1.0, still happening and not sure why. In the supervisor
> logs there's a mention of the assignments path being deleted:
>
> 2017-08-15 11:26:41.688 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got WatchedEvent
> state:SyncConnected type:NodeDeleted path:/assignments/excla-
> 1-1502809921 for sessionid 0x15d96e85f7123d4
>
> However, the nimbus seems fine for the most part. All executors are alive,
> the Scheduler indicates all nodes are alive, and then the topology just
> stops. Does anyone have any tips for debugging this problem?
>
> On Tue, Jul 18, 2017 at 2:45 PM, Zachary Smith <tud20772@temple.edu>
> wrote:
>
>> I can provide more logs if needed. The nimbus doesn't seem to log
>> anything about this at all, the last relevant thing in its logs is logging
>> the heartbeat cache and assignments.
>>
>> On Mon, Jul 17, 2017 at 5:51 PM, Zachary Smith <Zack.Smith@temple.edu>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am experiencing a problem on a remote cluster which causes a running
>>> topology to be killed after around 15-20 minutes. We are deploying to a
>>> cluster with a nimbus and three worker nodes, with around 8GB RAM each.
>>> This cluster shares a zookeeper cluster with another storm cluster (version
>>> 0.9.3) which runs topologies without issue. The storm 0.9.3 cluster has a
>>> different zookeeper root.
>>>
>>> Nothing else is running on the cluster. There doesn't appear to be any
>>> error in the logs, the supervisor simply states that it's shutting down the
>>> worker, and then the topology is promptly shut down. The worker's state is
>>> set to disabled according to the supervisor. We have already tried setting
>>> nimbus.task.timeout.secs: 600
>>> nimbus.supervisor.timeout.secs: 600
>>> in the storm.yaml but that did not appear to help. I have enabled gc
>>> logging, but it doesn't appear to be logging anything. It would be very
>>> surprising to me if system load would be causing this, since the topology
>>> does virtually nothing. Can anybody figure out what's going on here?
>>>
>>> Logs will be posted below.
>>> *Our storm.yaml:*
>>>
>>> nimbus.host: stg1examplestorm2-nimbus01.example.com
>>> nimbus.thrift.port: 6627
>>> nimbus.thrift.max_buffer_size: 32000000
>>> storm.local.dir: /data/storm
>>>
>>> storm.zookeeper.servers:
>>> - stg1examplezk-storm01.example.com
>>> - stg1examplezk-storm02.example.com
>>> - stg1examplezk-storm03.example.com
>>>
>>>
>>> supervisor.slots.ports:
>>> - 6700
>>> - 6701
>>> - 6702
>>> - 6703
>>> - 6704
>>> - 6705
>>> - 6706
>>> - 6707
>>> - 6708
>>> - 6709
>>> - 6710
>>> - 6711
>>> - 6712
>>> - 6713
>>> - 6714
>>> - 6715
>>> - 6716
>>> - 6717
>>> - 6718
>>> - 6719
>>> - 6720
>>>
>>> ui.port: 6060
>>>
>>> storm.log.dir: /var/log/storm
>>>
>>> storm.zookeeper.root: /storm_0_10
>>>
>>>
>>>
>>> *Topology logs:*
>>>
>>> [a lot of this and zookeeper responses]
>>> 2017-07-17 17:03:00.574 b.s.m.n.Client [DEBUG] writing 1 messages to
>>> channel [id: 0x6ee64433, /10.202.144.23:42166 =>
>>> stg1examplestorm2-node02.example.com/10.202.144.20:6720]
>>> 2017-07-17 17:03:00.574 b.s.m.n.Client [DEBUG] writing 1 messages to
>>> channel [id: 0x6ee64433, /10.202.144.23:42166 =>
>>> stg1examplestorm2-node02.example.com/10.202.144.20:6720]
>>> 2017-07-17 17:03:00.575 b.s.m.n.Client [DEBUG] sent 1 messages to
>>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720
>>> 2017-07-17 17:03:00.575 b.s.m.n.Client [DEBUG] sent 1 messages to
>>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720
>>> 2017-07-17 17:03:00.613 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got
>>> notification sessionid:0x35d4744d5c10011
>>> 2017-07-17 17:03:00.614 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got
>>> WatchedEvent state:SyncConnected type:NodeDeleted
>>> path:/assignments/excla-testcla-1-1500324928 for sessionid 0x35d4744d5c10011
>>> 2017-07-17 17:03:00.615 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE]
>>> Trace: EventListener - 0 ms
>>> 2017-07-17 17:03:00.615 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE]
>>> Trace: connection-state-parent-process - 0 ms
>>> 2017-07-17 17:03:00.637 b.s.d.worker [INFO] Shutting down worker
>>> excla-testcla-1-1500324928 13751591-813e-4721-ac3d-55084e32768f 6720
>>> 2017-07-17 17:03:00.638 b.s.m.n.Client [INFO] closing Netty Client
>>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720
>>> 2017-07-17 17:03:00.638 b.s.m.n.Client [INFO] waiting up to 600000 ms to
>>> send 0 pending messages to
>>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720
>>> 2017-07-17 17:03:00.639 b.s.m.n.Client [DEBUG] channel to
>>> Netty-Client-stg1examplestorm2-node02.example.com/10.202.144.20:6720
>>> closed
>>> 2017-07-17 17:03:00.639 b.s.m.n.Client [INFO] closing Netty Client
>>> Netty-Client-stg1examplestorm2-node01.example.com/10.202.144.16:6720
>>> 2017-07-17 17:03:00.639 b.s.m.n.Client [INFO] waiting up to 600000 ms to
>>> send 0 pending messages to
>>> Netty-Client-stg1examplestorm2-node01.example.com/10.202.144.16:6720
>>> 2017-07-17 17:03:00.639 b.s.m.n.Client [DEBUG] channel to
>>> Netty-Client-stg1examplestorm2-node01.example.com/10.202.144.16:6720
>>> closed
>>> 2017-07-17 17:03:00.639 b.s.d.worker [INFO] Shutting down receive thread
>>> 2017-07-17 17:03:00.640 b.s.m.n.Client [INFO] creating Netty Client,
>>> connecting to stg1examplestorm2-node03.example.com:6720, bufferSize:
>>> 5242880
>>> 2017-07-17 17:03:00.640 o.a.s.s.o.a.c.r.ExponentialBackoffRetry [WARN]
>>> maxRetries too large (300). Pinning to 29
>>> 2017-07-17 17:03:00.640 b.s.u.StormBoundedExponentialBackoffRetry [INFO]
>>> The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [300]
>>> 2017-07-17 17:03:00.703 b.s.m.loader [INFO] Shutting down
>>> receiving-thread: [excla-testcla-1-1500324928, 6720]
>>> 2017-07-17 17:03:00.703 b.s.m.loader [INFO] Waiting for
>>> receiving-thread:[excla-testcla-1-1500324928, 6720] to die
>>> 2017-07-17 17:03:00.704 b.s.m.loader [INFO] Shutdown receiving-thread:
>>> [excla-testcla-1-1500324928, 6720]
>>> 2017-07-17 17:03:00.704 b.s.d.worker [INFO] Shut down receive thread
>>> 2017-07-17 17:03:00.704 b.s.d.worker [INFO] Terminating messaging context
>>> 2017-07-17 17:03:00.705 b.s.d.worker [INFO] Shutting down executors
>>> 2017-07-17 17:03:00.705 b.s.d.executor [INFO] Shutting down executor
>>> word:[12 12]
>>> 2017-07-17 17:03:00.706 b.s.util [INFO] Async loop interrupted!
>>> 2017-07-17 17:03:00.713 b.s.util [INFO] Async loop interrupted!
>>>
>>> etc...
>>>
>>> *supervisor logs:*
>>>
>>> 2017-07-17 17:02:58.142 b.s.d.supervisor [DEBUG] Syncing processes
>>> 2017-07-17 17:02:58.142 b.s.d.supervisor [DEBUG] Assigned executors:
>>> {6720 {:storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1] [16
>>> 16] [13 13] [10 10] [4 4]]}}
>>> 2017-07-17 17:02:58.143 b.s.d.supervisor [DEBUG] Allocated:
>>> {"e96fc1c2-eeab-40f1-8639-ede11233415b" [:valid {:time-secs 1500325377,
>>> :storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1] [-1 -1] [16
>>> 16] [13 13] [10 10] [4 4]], :port 6720}]}
>>> 2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got
>>> notification sessionid:0x35d4744d5c10010
>>> 2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got
>>> WatchedEvent state:SyncConnected type:NodeDeleted
>>> path:/assignments/excla-testcla-1-1500324928 for sessionid 0x35d4744d5c10010
>>> 2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got
>>> notification sessionid:0x35d4744d5c10010
>>> 2017-07-17 17:03:00.742 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Got
>>> WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/assignments
>>> for sessionid 0x35d4744d5c10010
>>> 2017-07-17 17:03:00.743 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE]
>>> Trace: EventListener - 0 ms
>>> 2017-07-17 17:03:00.743 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE]
>>> Trace: connection-state-parent-process - 0 ms
>>> 2017-07-17 17:03:00.743 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE]
>>> Trace: EventListener - 0 ms
>>> 2017-07-17 17:03:00.744 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE]
>>> Trace: connection-state-parent-process - 0 ms
>>> 2017-07-17 17:03:00.744 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Reading reply
>>> sessionid:0x35d4744d5c10010, packet:: clientPath:null serverPath:null
>>> finished:false header:: 482,12  replyHeader:: 482,81609600378,0  request::
>>> '/storm_0_10/assignments,T  response:: v{},s{34381319$
>>> 47,34381319647,1491855399679,1491855399679,0,78,0,0,1,0,81609600378}
>>> 2017-07-17 17:03:00.744 o.a.s.s.o.a.c.u.DefaultTracerDriver [TRACE]
>>> Trace: GetChildrenBuilderImpl-Foreground - 1 ms
>>> 2017-07-17 17:03:00.745 b.s.d.supervisor [DEBUG] Synchronizing supervisor
>>> 2017-07-17 17:03:00.745 b.s.d.supervisor [DEBUG] Storm code map: {}
>>> 2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] Downloaded storm ids:
>>> #{"excla-testcla-1-1500324928"}
>>> 2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] All assignment:
>>> 2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] New assignment: {}
>>> 2017-07-17 17:03:00.746 b.s.d.supervisor [DEBUG] Writing new assignment
>>> {}
>>> 2017-07-17 17:03:00.748 b.s.d.supervisor [INFO] Removing code for storm
>>> id excla-testcla-1-1500324928
>>> 2017-07-17 17:03:00.750 b.s.util [DEBUG] Rmr path
>>> /data/storm/supervisor/stormdist/excla-testcla-1-1500324928
>>> 2017-07-17 17:03:00.753 b.s.u.LocalState [DEBUG] New Local State for
>>> /data/storm/workers/e96fc1c2-eeab-40f1-8639-ede11233415b/heartbeats
>>> 2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Worker
>>> e96fc1c2-eeab-40f1-8639-ede11233415b is :disallowed: {:time-secs
>>> 1500325379, :storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1]
>>> [-1 -1] [16 16] [13 13] [10 10] [4 4]], :port 6720} at supervisor time-se
>>> cs 1500325380
>>> 2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Syncing processes
>>> 2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Assigned executors: {}
>>> 2017-07-17 17:03:00.754 b.s.d.supervisor [DEBUG] Allocated:
>>> {"e96fc1c2-eeab-40f1-8639-ede11233415b" [:disallowed {:time-secs
>>> 1500325379, :storm-id "excla-testcla-1-1500324928", :executors [[7 7] [1 1]
>>> [-1 -1] [16 16] [13 13] [10 10] [4 4]], :port 6720}]}
>>> 2017-07-17 17:03:00.755 b.s.d.supervisor [INFO] Shutting down and
>>> clearing state for id e96fc1c2-eeab-40f1-8639-ede11233415b. Current
>>> supervisor time: 1500325380. State: :disallowed, Heartbeat: {:time-secs
>>> 1500325379, :storm-id "excla-testcla-1-1500324928", :executors [
>>> [7 7] [1 1] [-1 -1] [16 16] [13 13] [10 10] [4 4]], :port 6720}
>>> 2017-07-17 17:03:00.755 b.s.d.supervisor [INFO] Shutting down
>>> 404f7dd0-363c-4161-a19e-c95e56cbe96e:e96fc1c2-eeab-40f1-8639-ede11233415b
>>> 2017-07-17 17:03:00.755 b.s.config [INFO] GET worker-user
>>> e96fc1c2-eeab-40f1-8639-ede11233415b
>>>
>>
>>
>

Mime
View raw message