cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Thompson <Nick.Thomp...@neos.co.nz>
Subject RE: Virtual routers randomly rebooting
Date Mon, 19 Aug 2019 00:00:20 GMT
So it's been a week and we haven't had any virtual routers randomly reboot. Gregor hit the
nail on the head advising the memory issue. Since the latest HVM versions of the virtual router
(4.11.X) I have noticed more swap activity (kswapd process running higher CPU than usual),
seeing swap happening with performance monitoring, swap used is about 40-50MB. Effected system
VMs are the ones that are setup as VPC or actively running HAProxy.

I have allocated an extra 128MB of RAM to the effected routers and the issue seems to have
gone away (for the time being anyway)

Modified service offering in SQL;
#Get the ID
select * from service_offering_view where name LIKE "System Offering For Software Router";
#Update the service offering
update service_offering_view set ram_size = 384 where id = 7;
#Rebooted test router, confirmed new memory is available.


Is the currently allocated 256MB of memory no longer enough for system VMs?
If not, how much should be allocated? An extra 128MB does seem a bit excessive, maybe 64MB
is enough?
What could be causing this extra usage?

Nothing is really standing out in the way of excessive memory usage between each of the system
VMs.

#Top memory usage in a VPC virtual router (V 4.11.3)
ps -o pid,user,%mem,command ax | sort -b -k3 -r
PID 	User	%MEM COMMAND
1369 	root	3.2	python /opt/cloud/bin/passwd_server_ip.py 10.30.0.1
3607	root	3.0	/usr/lib/ipsec/Sharon
354	root	2.1	/usr/sbin/xe-daemon
1416	root	1.8	/usr/sbin/apache2 -k start
1	root	1.8	/sbin/init
748	root	1.8 /lib/system/system-journald

#Top memory usage in a VPC virtual router (V 4.6)
PID 	USER     		%MEM 		COMMAND
4474 	root      		5.9 	/usr/lib/ipsec/pluto 
3180	 root      		3.3 	python /opt/cloud/bin/passwd_server_ip.py 10.30.0.1
2370 	root     		2.4	/usr/sbin/rsyslogd -c5
3116 	root     		1.9	/usr/sbin/apache2 -k start
3122	www-data 	1.3	/usr/sbin/apache2 -k start
3121 	www-data  	1.3	/usr/sbin/apache2 -k start
4526 	root     		1.3 	pluto helper  #  0                     

Regards,
Nick Thompson

-----Original Message-----
From: Nick Thompson [mailto:Nick.Thompson@neos.co.nz] 
Sent: Wednesday, 7 August 2019 12:34 p.m.
To: 'users@cloudstack.apache.org' <users@cloudstack.apache.org>
Subject: RE: Virtual routers randomly rebooting

Thanks Gregor,

I'll give that a go. I did notice a high load average on some VRs a while back, could be related.

Regards,

Nick Thompson


-----Original Message-----
From: Riepl, Gregor (SWISS TXT) [mailto:Gregor.Riepl@swisstxt.ch]
Sent: Wednesday, 7 August 2019 3:44 a.m.
To: users@cloudstack.apache.org
Subject: Re: Virtual routers randomly rebooting

Hi Nick,

This might not be relevant for Xen, but we've had problems with memory leaks on the VRs on
VMware when balloon memory was enabled.

A while ago, we built a custom router monitoring setup via SSH for our environment, because
CloudStack doesn't give us enough information about router status. This caused the VR kernel
to leak memory, and the router to reboot suddenly when memory was used up.

The issue was fixed by several memory management optimisations on the system VM template (done
by René, Rohit and Angus, if I remember correctly) and by setting an OS type that would cause
VMware to completely disable balloon memory.

It's possible that you have a similar issue - can you monitor the affected VRs for a while
and see if the reboots are caused by a memory leak?
We still see memory being used up slowly, but when a critical threshold (~98%) is reached,
the kernel will garbage collect it.

Regards,
Gregor
________________________________
From: Nick Thompson <Nick.Thompson@neos.co.nz>
Sent: 06 August 2019 06:01
To: 'users@cloudstack.apache.org' <users@cloudstack.apache.org>
Subject: RE: Virtual routers randomly rebooting

Hey,

Thanks Andrija,

>From what I have read I didn't need to add the hypervisor mapping 7.5 as CloudStack only
looks at the version number rather than the name of the hypervisor at this stage (e.g. XenServer
vs XCP-ng). Also the issue was happening in XenServer 6.5 anyway and CloudStack isn't having
any problems controlling XCP-ng.

>From what I have seen so far only some VPCs are randomly rebooting (on different hosts
too), Storage and Console and standard network VMs seem to be fine (however I have a lot more
VPCs than any other network type). I'm not sure if the VM is rebooting itself or if the Management
Server is having an issue communicating with the VM so shutting it down and restarting it.

Is there a way to disable checks on VPCs so it doesn't try and restart the router VM? I have
found/tried network.router.EnableServiceMonitoring = NO but from the documentation it notes
that VPC networks are not supported.

Any suggestions in what I could try/look into would greatly be appreciated.

Regards,
Nick Thompson


-----Original Message-----
From: Andrija Panic [mailto:andrija.panic@gmail.com]
Sent: Wednesday, 17 July 2019 6:24 a.m.
To: users <users@cloudstack.apache.org>; Rohit Yadav <rohit.yadav@shapeblue.com>
Subject: Re: Virtual routers randomly rebooting

Have you added os/hypervisor mappings inside the DB? I vaguely remember 7.5 not having needed
mapping and was considered to be 6.5, thus a manual fix was needed.

Perhaps Rohit can sched some light?

Anyways, a full log would be great (pastebin or other online service please).

Regards

On Tue, 16 Jul 2019, 00:59 Nick Thompson, <Nick.Thompson@neos.co.nz> wrote:

> Hey,
>
> Since we upgraded to the 4.11 branch (currently 4.11.3) and virtual 
> routers have become HVM on XenServer/XCP-ng we have had problems with 
> the virtual routers randomly rebooting themselves. We still have some 
> running in the older paravirtualized mode and they seem to be fine (it 
> may be that the Management server can't communicate to these virtual 
> routers since they are an older template?). Other running Windows/Linux VMs are fine.
>
> CloudStack Cluster: XCP-ng 7.5 (previously XenServer 6.5, same issue 
> was
> happening)
> CloudStack: 4.11.3 (same issue in 4.11.2, was working fine in V4.9.3
>
> When digging through the management-server.log, I have found the 
> following;
>
> >grep -n "Error while collecting network stats from router"
> /var/log/cloudstack/management/management-server.log
> 919060:2019-07-16 10:28:43,756 WARN
> [c.c.n.r.VirtualNetworkApplianceManagerImpl]
> (RouterMonitor-1:ctx-1931d792)
> (logid:6e236728) Error while collecting network stats from router:
> r-2280-VM from host: 71; details: Exception: java.lang.Exception
> 919328:2019-07-16 10:28:50,940 WARN
> [c.c.n.r.VirtualNetworkApplianceManagerImpl]
> (RouterMonitor-1:ctx-1931d792)
> (logid:6e236728) Error while collecting network stats from router:
> r-2280-VM from host: 71; details: Exception: java.lang.Exception
> 920533:2019-07-16 10:29:54,768 WARN
> [c.c.n.r.VirtualNetworkApplianceManagerImpl]
> (Work-Job-Executor-63:ctx-6d1f0235 job-24930/job-24944 ctx-a15a2697)
> (logid:2b94fd5d) Error while collecting network stats from router:
> r-2280-VM from host: 71; details: Exception: java.lang.Exception
> 920621:2019-07-16 10:30:01,952 WARN
> [c.c.n.r.VirtualNetworkApplianceManagerImpl]
> (Work-Job-Executor-63:ctx-6d1f0235 job-24930/job-24944 ctx-a15a2697)
> (logid:2b94fd5d) Error while collecting network stats from router:
> r-2280-VM from host: 71; details: Exception: java.lang.Exception
>
> >less +920621 /var/log/cloudstack/management/management-server.log
> 2019-07-16 10:30:01,952 DEBUG [c.c.a.t.Request]
> (Work-Job-Executor-63:ctx-6d1f0235 job-24930/job-24944 ctx-a15a2697)
> (logid:2b94fd5d) Seq 71-7505811728966836555: Received:  { Ans: , MgmtId:
> 226842157555374, via: 71(hostname), Ver: v1, Flags: 10, { 
> NetworkUsageAnswer } }
> 2019-07-16 10:30:01,952 DEBUG [c.c.a.m.AgentManagerImpl]
> (Work-Job-Executor-63:ctx-6d1f0235 job-24930/job-24944 ctx-a15a2697)
> (logid:2b94fd5d) Details from executing class
> com.cloud.agent.api.NetworkUsageCommand: Exception:
> java.lang.Exception
> Message:  vpc network usage plugin call failed
> Stack: java.lang.Exception:  vpc network usage plugin call failed
>         at
> com.cloud.hypervisor.xenserver.resource.wrapper.xen56.XenServer56NetworkUsageCommandWrapper.executeNetworkUsage(XenServer56NetworkUsageCommandWrapper.java:84)
>         at
> com.cloud.hypervisor.xenserver.resource.wrapper.xen56.XenServer56NetworkUsageCommandWrapper.execute(XenServer56NetworkUsageCommandWrapper.java:41)
>         at
> com.cloud.hypervisor.xenserver.resource.wrapper.xen56.XenServer56NetworkUsageCommandWrapper.execute(XenServer56NetworkUsageCommandWrapper.java:33)
>         at
> com.cloud.hypervisor.xenserver.resource.wrapper.xenbase.CitrixRequestWrapper.execute(CitrixRequestWrapper.java:122)
>         at
> com.cloud.hypervisor.xenserver.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:1737)
>         at
> com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAgentAttache.java:315)
>         at
> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>         at
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>         at
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>         at
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>         at
> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>
> 2019-07-16 10:30:01,952 WARN
> [c.c.n.r.VirtualNetworkApplianceManagerImpl]
> (Work-Job-Executor-63:ctx-6d1f0235 job-24930/job-24944 ctx-a15a2697)
> (logid:2b94fd5d) Error while collecting network stats from router:
> r-2280-VM from host: 71; details: Exception: java.lang.Exception
> Message:  vpc network usage plugin call failed
> Stack: java.lang.Exception:  vpc network usage plugin call failed
>         at
> com.cloud.hypervisor.xenserver.resource.wrapper.xen56.XenServer56NetworkUsageCommandWrapper.executeNetworkUsage(XenServer56NetworkUsageCommandWrapper.java:84)
>         at
> com.cloud.hypervisor.xenserver.resource.wrapper.xen56.XenServer56NetworkUsageCommandWrapper.execute(XenServer56NetworkUsageCommandWrapper.java:41)
>         at
> com.cloud.hypervisor.xenserver.resource.wrapper.xen56.XenServer56NetworkUsageCommandWrapper.execute(XenServer56NetworkUsageCommandWrapper.java:33)
>         at
> com.cloud.hypervisor.xenserver.resource.wrapper.xenbase.CitrixRequestWrapper.execute(CitrixRequestWrapper.java:122)
>         at
> com.cloud.hypervisor.xenserver.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:1737)
>         at
> com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAgentAttache.java:315)
>         at
> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>         at
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>         at
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>         at
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>         at
> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>
> 2019-07-16 10:30:01,973 DEBUG [c.c.a.t.Request]
> (Work-Job-Executor-63:ctx-6d1f0235 job-24930/job-24944 ctx-a15a2697)
> (logid:2b94fd5d) Seq 71-7505811728966836557: Sending  { Cmd , MgmtId:
> 226842157555374, via: 71(hostname), Ver: v1, Flags: 100011, 
> [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"checkBeforeClean
> up":false,"controlIp":"169.254.0.200","forceStop":true,"volumesToDisco
> nnect":[],"vmName":"r-2280-VM","executeInSequence":false,"wait":0}}]
> }
> 2019-07-16 10:30:01,973 DEBUG [c.c.a.t.Request]
> (Work-Job-Executor-63:ctx-6d1f0235 job-24930/job-24944 ctx-a15a2697)
> (logid:2b94fd5d) Seq 71-7505811728966836557: Executing:  { Cmd , MgmtId:
> 226842157555374, via: 71(hostname), Ver: v1, Flags: 100011, 
> [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"checkBeforeClean
> up":false,"controlIp":"169.254.0.200","forceStop":true,"volumesToDisco
> nnect":[],"vmName":"r-2280-VM","executeInSequence":false,"wait":0}}]
> }
> 2019-07-16 10:30:01,973 DEBUG [c.c.a.m.DirectAgentAttache]
> (DirectAgent-430:ctx-77c57645) (logid:8733e444) Seq 71-7505811728966836557:
> Executing request
> 2019-07-16 10:30:01,995 DEBUG [c.c.h.x.r.w.x.CitrixStopCommandWrapper]
> (DirectAgent-430:ctx-77c57645) (logid:2b94fd5d) 9. The VM r-2280-VM is 
> in Stopping state
> 2019-07-16 10:30:02,303 DEBUG [c.c.a.m.DirectAgentAttache]
> (DirectAgent-390:ctx-3b9a78cd) (logid:27f6ec94) Seq 67-8459448950062117779:
> Response Received
>
>
> Any thoughts would be greatly appreciated.
>
> Cheers,
> Nick.
>

Mime
View raw message