cloudstack-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakub Kublik <jakub_kub...@polcom.com.pl>
Subject Re: [Questionable] Re: Timeout with live migration
Date Tue, 13 Oct 2015 08:45:55 GMT
Hi Ryan,
we hit the same problem few days ago. After we changed parameters: 
migratewait, storage.pool.max.waitseconds, vm.op.cancel.interval and 
vm.op.cleanup.wait to 36000 (10h) we can migrate large volumes (500GB 
and more).
We use xenserver 6.5 and ACS 4.5.1
Regards,
Kuba

W dniu 2015-10-13 o 05:15, Ryan Farrington pisze:
> Yes i can open JIRA tickets. What would you like for me to do?
>
> I'll be happy to change the "wait" parameter.  Do I assume it should be 1/2 of the value
i want it to be?
>
>
>
> ________________________________________
> From: Rafael Weingärtner [rafaelweingartner@gmail.com]
> Sent: Monday, October 12, 2015 10:12 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> There is your problem, there are currently two distinct values conrolling
> those async jobs.
> Change that value and everything will work for u.
> Can you open a jira ticket?
>
> On Mon, Oct 12, 2015 at 11:51 PM, Ryan Farrington <rfarrington@remitdata.com
>> wrote:
>> wait is currently configured to be 3600
>>
>>
>>
>> ________________________________________
>> From: Rafael Weingärtner [rafaelweingartner@gmail.com]
>> Sent: Monday, October 12, 2015 9:46 PM
>> To: users@cloudstack.apache.org
>> Subject: [Questionable]  Re: Timeout with live migration
>>
>> I found something odd,
>> can you check the parameter called "wait", what value is it using ?
>>
>> On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington <
>> rfarrington@remitdata.com
>>> wrote:
>>> Yes the parameter was set long ago and the management server has been
>>> restarted numerous time over the past few days as we played with other
>>> parameters to no effect.
>>>
>>> After looking at the log a little more does the "Failed to send command,
>>> due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
>>> 996939857 to Host 38 timed out after 7200" mean that the migration start
>>> command is being sent in some kind of synchronous mode and not returning
>>> control back to the job manager?
>>>
>>>
>>>
>>>
>>> ________________________________________
>>> From: Rafael Weingärtner [rafaelweingartner@gmail.com]
>>> Sent: Monday, October 12, 2015 8:46 PM
>>> To: users@cloudstack.apache.org
>>> Subject: [Questionable]  Re: Timeout with live migration
>>>
>>> I thought you using the command  “migrateVirtualMachineWithVolume” but it
>>> seems that you are using “migrateVolume” command from ACS's API.
>>>
>>>
>>> For the code I debugged “migrateVirtualMachineWithVolume”, the parameter
>>> 3600, means 1 hour of timeout.
>>>
>>> For the “migrateVolume” is the same, they both end up in
>>>
>>>
>> “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,
>>> and in that method the parameter is the same.
>>>
>>>
>>> If your parameter is set to 36000 (10 hours) I do not see why you are
>>> getting the exception after 2 hours.
>>>
>>> Did you restart the management servers after you changed the parameter?
>>>
>>> On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington <
>>> rfarrington@remitdata.com
>>>> wrote:
>>>> Here is the full log, including the stack for the exception, that we
>> get
>>>> at the 2 hour mark. as for the migratewait it is set to 36000 which
>>> should
>>>> be 10 hours.
>>>>
>>>> 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
>>>> (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
>>>> 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
>>>> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting
>>> some
>>>> more time because this is the current command
>>>> 2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
>>>> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception:
>>>> com.cloud.exception.OperationTimedoutException in error code list for
>>>> exceptions
>>>> 2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
>>>> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out
>>> on
>>>> Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via:
>> 38(xen-nc-bc2b7),
>>>> Ver: v1, Flags: 100111,
>>>>
>> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
>>>> }
>>>> 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
>>>> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857:
>> Cancelling.
>>>> 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
>>>> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more
>>>> commands found
>>>> 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
>>>> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due
>>> to
>>>> Agent:38, com.cloud.exception.OperationTimedoutException: Commands
>>>> 996939857 to Host 38 timed out after 7200
>>>> 2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy]
>>>> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
>>>> com.cloud.utils.exception.CloudRuntimeException: Failed to send
>> command,
>>>> due to Agent:38, com.cloud.exception.OperationTimedoutException:
>> Commands
>>>> 996939857 to Host 38 timed out after 7200
>>>>          at
>>>>
>> org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)
>>>>          at
>>>>
>> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)
>>>>          at
>>>>
>> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)
>>>>          at
>>>>
>> org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)
>>>>          at
>>>>
>> org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931)
>>>>          at
>>>>
>> com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680)
>>>>          at
>>>>
>> com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666)
>>>>          at
>>>>
>> com.cloud.storage.VolumeApiServiceImpl.migrateVolume(VolumeApiServiceImpl.java:1622)
>>>>          at sun.reflect.GeneratedMethodAccessor335.invoke(Unknown
>> Source)
>>>>          at
>>>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>          at java.lang.reflect.Method.invoke(Method.java:622)
>>>>          at
>>>>
>> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
>>>>          at
>>>>
>> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
>>>>          at
>>>>
>> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
>>>>          at
>>>>
>> org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
>>>>          at
>>>>
>> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
>>>>          at
>>>>
>> org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
>>>>          at com.sun.proxy.$Proxy196.migrateVolume(Unknown Source)
>>>>          at
>>>>
>> org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd.execute(MigrateVolumeCmd.java:103)
>>>>          at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161)
>>>>          at
>>>>
>> com.cloud.api.ApiAsyncJobDispatcher.runJobInContext(ApiAsyncJobDispatcher.java:109)
>>>>          at
>>>>
>> com.cloud.api.ApiAsyncJobDispatcher$1.run(ApiAsyncJobDispatcher.java:66)
>>>>          at
>>>>
>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>>>>          at
>>>>
>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>>>>          at
>>>>
>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>>>>          at
>>>>
>> com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:63)
>>>>          at
>>>>
>> org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:509)
>>>>          at
>>>>
>> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>>>>          at
>>>>
>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>>>>          at
>>>>
>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>>>>          at
>>>>
>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>>>>          at
>>>>
>> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>>>>          at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>          at
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>>>          at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>>>          at
>>>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>>          at
>>>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>          at java.lang.Thread.run(Thread.java:701)
>>>> 2015-10-12 18:41:20,479 WARN  [o.a.c.s.d.ObjectInDataStoreManagerImpl]
>>>> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Unsupported data object
>>>> (VOLUME,
>>>> org.apache.cloudstack.storage.datastore.PrimaryDataStoreImpl@4fa7a45f
>> ),
>>>> no need to delete from object in store ref table
>>>> 2015-10-12 18:41:20,479 DEBUG [c.c.s.VolumeApiServiceImpl]
>>>> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) migrate volume
>>>> failed:com.cloud.utils.exception.CloudRuntimeException: Failed to send
>>>> command, due to Agent:38,
>> com.cloud.exception.OperationTimedoutException:
>>>> Commands 996939857 to Host 38 timed out after 7200
>>>> 2015-10-12 18:41:20,480 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>>> (Job-Executor-63:ctx-f7b6817d) Complete async job-5257, jobStatus:
>>> FAILED,
>>>> resultCode: 530, result:
>>>>
>> org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Failed
>>>> to migrate volume"}
>>>> 2015-10-12 18:41:20,486 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
>>>> (Job-Executor-63:ctx-f7b6817d) Done executing
>>>> org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd for
>>> job-5257
>>>> 2015-10-12 18:41:20,489 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
>>>> (Job-Executor-63:ctx-f7b6817d) Remove job-5257 from job monitoring
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________________
>>>> From: Rafael Weingärtner [rafaelweingartner@gmail.com]
>>>> Sent: Monday, October 12, 2015 8:24 PM
>>>> To: users@cloudstack.apache.org
>>>> Subject: [Questionable]  Re: [Questionable] Re: Timeout with live
>>> migration
>>>> Now I understand what you are doing, I am familiar with that concept
>>> (live
>>>> migration of VM within a cluster, having the VHD being moved from one
>> SR
>>> to
>>>> another).
>>>>
>>>> I just got confused when I read live migration of volumes (a volume
>> does
>>>> not run by itself, so that why I asked a little for some more
>>> information).
>>>> Looking at the source code this is the variable used to control the
>>>> timeout:
>>>> "long timeout = (_migratewait) * 1000L;"
>>>>
>>>> The value of "_migratewait" is taken from this parameter:
>>>> value = (String) params.get("migratewait");
>>>> _migratewait = NumbersUtil.parseInt(value, 3600);
>>>>
>>>> Therefore, the name of the parameter to be configured is "migratewait",
>>> the
>>>> default value is 3600.
>>>>
>>>>
>>>> BTW1: I think that is a terrible parameter name. We should refactor
>> that,
>>>> could you open a Jira ticket for that?
>>>>
>>>> BTW2: that error message you posted does not seem to be related to the
>>>> migration timeout; hence, in the code if the copy times out the message
>>>> would be:
>>>> "Async " + timeout/1000 + " seconds timeout for task " +
>> task.toString()"
>>>> Maybe because it throws a "Types.BadAsyncResult(msg)" and that might be
>>>> translated into that message, or that might not be related to the
>> problem
>>>> itself, and you just thought that it was.
>>>>
>>>>
>>>> Does it help you?
>>>>
>>>>
>>>> On Mon, Oct 12, 2015 at 10:00 PM, Ryan Farrington <
>>>> rfarrington@remitdata.com
>>>>> wrote:
>>>>> Hypervisor:  XenServer
>>>>>
>>>>> We are moving a data volume from one storage onto another without
>>>> shutting
>>>>> down the VM cause that would just be silly and a triplication of
>> effort
>>>>> with the whole copying to secondary storage and then back off again.
>>> The
>>>>> volume is staying in the same cluster just moving to a different
>>> Primary
>>>>> storage (or SR in the XenServer vernacular)
>>>>>
>>>>> If you are familiar with ESX this is a "Storage VMotion" where as in
>>>>> XenServer it is called "Storage XenMotion".
>>>>>
>>>>> ________________________________________
>>>>> From: Rafael Weingärtner [rafaelweingartner@gmail.com]
>>>>> Sent: Monday, October 12, 2015 7:53 PM
>>>>> To: users@cloudstack.apache.org
>>>>> Subject: [Questionable]  Re: Timeout with live migration
>>>>>
>>>>> what do you mean with livre migrating data volume ?!
>>>>> I understand a live migration of a VM, but volumes...
>>>>>
>>>>> do you mean live migrating a VM that has a volume attached?
>>>>> are you migrating that volume to a different cluster? or just a
>>> different
>>>>> storage in the same cluster?
>>>>> What hypervisor are you using ?
>>>>>
>>>>>
>>>>> On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington <
>>>>> rfarrington@remitdata.com>
>>>>> wrote:
>>>>>
>>>>>> Live migrating a data volume. We are purely on shared storage so
no
>>>> local
>>>>>> storage is involved.
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Rafael Weingärtner [rafaelweingartner@gmail.com]
>>>>>> Sent: Monday, October 12, 2015 7:37 PM
>>>>>> To: users@cloudstack.apache.org
>>>>>> Subject: [Questionable]  Re: Timeout with live migration
>>>>>>
>>>>>> Are you live migrating a VM, or migrating a volume of a stopped VM
>>> to a
>>>>>> different primary storage?
>>>>>>
>>>>>> If it is a running VM, is the VM allocated in a shared storage or
>>> local
>>>>>> storage?
>>>>>>
>>>>>> On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington <
>>>>>> rfarrington@remitdata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The slow transfer is related to the storage we are trying to
>>> migrate
>>>>> off
>>>>>>> of.  We are capable of getting about 350mbps off the disks but
>> when
>>>> we
>>>>>> are
>>>>>>> moving volumes that are greater than about 500GB we end up racing
>>> the
>>>>>> clock
>>>>>>> and hoping that the migration finishes before the job times out.
>>>   It
>>>>>> would
>>>>>>> be awesome to be able to manage that timeout and I know there
>> are a
>>>> ton
>>>>>> of
>>>>>>> settings I just don't know about and am hoping someone might
be
>>> able
>>>> to
>>>>>>> point me in the right direction.
>>>>>>>
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Rafael Weingärtner [rafaelweingartner@gmail.com]
>>>>>>> Sent: Monday, October 12, 2015 6:40 PM
>>>>>>> To: users@cloudstack.apache.org
>>>>>>> Subject: [Questionable]  Re: Timeout with live migration
>>>>>>>
>>>>>>> I would first check your NICs' speed and load, the amount of
RAM
>>>>>> allocated
>>>>>>> for the migrating VM and than check the hypervisor log files.
>>>>>>>
>>>>>>> On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård <
>>>>>>> jan.arve.nygard@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> What version are you running? Check if the copy.volume.wait
>>> setting
>>>>> is
>>>>>>> set
>>>>>>>> to 7200 and increase it. If not you could also check
>>>>>>>> job.cancel.threshold.minutes and job.expire.minutes.
>>>>>>>>
>>>>>>>> -Jan-Arve
>>>>>>>>
>>>>>>>> 2015-10-13 0:46 GMT+02:00 Ryan Farrington <
>>>> rfarrington@remitdata.com
>>>>>> :
>>>>>>>>> We are experiencing a failure in cloudstack waiting for
an
>>> async
>>>>> job
>>>>>>>>> performing a live migration of a volume to finish. I've
>> copied
>>>> the
>>>>>>>> relevant
>>>>>>>>> log entries below.We acknowledge that the migration will
>> take a
>>>> few
>>>>>>> hours
>>>>>>>>> based on the volume of the data and we are looking for
a way
>> to
>>>>>>> increase
>>>>>>>>> the timeout of 7200 seconds into something we know we
can
>> work
>>>>> with.
>>>>>>>>>
>>>>>>>>> 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
>>>>>>>>> (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to
send
>>>> command,
>>>>>> due
>>>>>>>> to
>>>>>>>>> Agent:27, com.cloud.exception.OperationTimedoutException:
>>>> Commands
>>>>>>>>> 835325398 to Host 27 timed out after 7200
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Rafael Weingärtner
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Rafael Weingärtner
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Rafael Weingärtner
>>>>>
>>>>
>>>>
>>>> --
>>>> Rafael Weingärtner
>>>>
>>>
>>>
>>> --
>>> Rafael Weingärtner
>>>
>>
>>
>> --
>> Rafael Weingärtner
>>
>
>
> --
> Rafael Weingärtner


Mime
View raw message