gump-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam R. B. Jack" <>
Subject Re: termination algorithm
Date Mon, 21 Mar 2005 03:28:17 GMT

Thanks for your input. I recently (an hour ago, after a year+ of fighting
this & using finding ineffective solutions) implemented spawning along the
lines of:

>     1) fork, getting forkPID for the child process.
>     2) [child]:     setpgrp [become a group leader], run child
>     3) [parent]:   wait w/ a timer [60 mins] to sigkill the group
        [using, with some checking, the effect of

This ought allow us to clean as we go, with the limitations you stated. That
said, there are some additional good things to learn from your solution, so
I'll capture those sophistications also.

Thanks for the info.


----- Original Message ----- 
From: "Greg Stein" <>
To: <>
Cc: <>
Sent: Sunday, March 20, 2005 7:46 PM
Subject: termination algorithm

> Hi all,
> Was talking with Leo here at the infrastructure gathering, and he
> mentioned that Gump was having issues cleaning up zombie processes. He
> asked me how to make that happen in Linux. The general reply is "use
> os.waitpid()" (in Python; waitpid() is the general POSIX thing).
> Ed Korthof and I explored this a bit more to come up with a general
> algorithm for cleaning up "everything" after a Gump run.
> To start with, Gump should put all fork'd children into their own process
> groups, and then remember those groups' ids. This will enable you to kill
> any grandchild process or other things that get spawned. Even if the
> process gets re-parented to the init process, you can give it the
> smackdown via the process group. Of course, if somebody else monkeys with
> process groups, you'll lose track of them. There are limits to cleanup :-)
> When you want to clean up, you can send every process group SIGTERM. If
> any killpg() call throws an exception with ESRCH (no processes in that
> group), then remove it from the saved list of groups. Next, you would
> start looping to wait for all processes to exit, or to reach a timer on
> that wait. You want to quickly loop on everything that exits, terminate
> the loop when there is nothing more, and then pause a second if stuff is
> still busy shutting down. If you timeout and some are left, then SIGKILL
> them and go reap again. The algorithm would look like:
> def clean_up_processes(pgrp_list):
>   # send SIGTERM to everything, and update pgrp_list to just those
>   # process groups which have processes in them.
>   kill_groups(pgrp_list, signal.SIGTERM)
>   # pass a copy of the process groups. we want to remember every
>   # group that we SIGTERM'd so that we can SIGKILL them later. it
>   # is possible that a process in the pgrp was reparented to the
>   # init process. those will be invisible to wait(), so we don't
>   # want to mistakenly think we've killed all processes in the
>   # group. thus, we preserve the list and SIGKILL it later.
>   reap_children(pgrp_list[:])
>   # SIGKILL everything, editing pgrp_list again.
>   kill_groups(pgrp_list, signal.SIGKILL)
>   # reap everything left, but don't really bother waiting on them.
>   # if we exit, then init will reap them.
>   reap_children(pgrp_list, 60)
> def kill_groups(pgrp_list, sig)
>   # NOTE: this function edits pgrp_list
>   for pgrp in pgrp_list[:]:
>     try:
>       os.killpg(-pgrp, sig)
>     except IOError, e:
>       if e.errno == errno.ESRCH:
>         pgrp_list.remove(pgrp)
> def reap_children(pgrp_list, timeout=300):
>   # NOTE: this function edits pgrp_list
>   # keep reaping until the timeout expires, or we finish
>   end_time = time.time() + timeout
>   # keep reaping until all pgrps are done, or we run out of time
>   while pgrp_list and time.time() < end_time:
>     # pause for a bit while processes work on exiting. this pause is
>     # at the top, so we can also pause right after the killpg()
>     time.sleep(1)
>     # go through all pgrps to reap them
>     for pgrp in pgrp_list[:]:
>       # loop quickly to clean everything in this pgrp
>       while 1:
>         try:
>           pid, status = os.waitpid(-pgrp, os.WNOHANG)
>         except IOError, e:
>           if e.errno == errno.ECHILD:
>             # no more children in this pgrp.
>             pgrp_list.remove(pgrp)
>             break
>           raise
>         if pid == 0:
>   # some stuff has not exited yet, and WNOHANG avoided
>   # blocking. go ahead and move to the next pgrp.
>   break
> That should clean up everything. If stuff *still* hasn't exited, then
> there isn't much you can do. But you will have tried :-)
> Hope that helps! EdK and I haven't built test cases for the above, but it
> has been doubly-reviewed, so we think the algorithm/code should work.
> Cheers,
> -g
> p.s. note that we aren't on general@gump, so CC: if you reply...
> -- 
> Greg Stein,
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message