gump-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Stein <>
Subject termination algorithm
Date Mon, 21 Mar 2005 02:46:52 GMT
Hi all,

Was talking with Leo here at the infrastructure gathering, and he
mentioned that Gump was having issues cleaning up zombie processes. He
asked me how to make that happen in Linux. The general reply is "use
os.waitpid()" (in Python; waitpid() is the general POSIX thing).

Ed Korthof and I explored this a bit more to come up with a general
algorithm for cleaning up "everything" after a Gump run.

To start with, Gump should put all fork'd children into their own process
groups, and then remember those groups' ids. This will enable you to kill
any grandchild process or other things that get spawned. Even if the
process gets re-parented to the init process, you can give it the
smackdown via the process group. Of course, if somebody else monkeys with
process groups, you'll lose track of them. There are limits to cleanup :-)

When you want to clean up, you can send every process group SIGTERM. If
any killpg() call throws an exception with ESRCH (no processes in that
group), then remove it from the saved list of groups. Next, you would
start looping to wait for all processes to exit, or to reach a timer on
that wait. You want to quickly loop on everything that exits, terminate
the loop when there is nothing more, and then pause a second if stuff is
still busy shutting down. If you timeout and some are left, then SIGKILL
them and go reap again. The algorithm would look like:

def clean_up_processes(pgrp_list):
  # send SIGTERM to everything, and update pgrp_list to just those
  # process groups which have processes in them.
  kill_groups(pgrp_list, signal.SIGTERM)
  # pass a copy of the process groups. we want to remember every
  # group that we SIGTERM'd so that we can SIGKILL them later. it
  # is possible that a process in the pgrp was reparented to the
  # init process. those will be invisible to wait(), so we don't
  # want to mistakenly think we've killed all processes in the
  # group. thus, we preserve the list and SIGKILL it later.

  # SIGKILL everything, editing pgrp_list again.
  kill_groups(pgrp_list, signal.SIGKILL)
  # reap everything left, but don't really bother waiting on them.
  # if we exit, then init will reap them.
  reap_children(pgrp_list, 60)

def kill_groups(pgrp_list, sig)
  # NOTE: this function edits pgrp_list

  for pgrp in pgrp_list[:]:
      os.killpg(-pgrp, sig)
    except IOError, e:
      if e.errno == errno.ESRCH:

def reap_children(pgrp_list, timeout=300):
  # NOTE: this function edits pgrp_list

  # keep reaping until the timeout expires, or we finish
  end_time = time.time() + timeout

  # keep reaping until all pgrps are done, or we run out of time
  while pgrp_list and time.time() < end_time:
    # pause for a bit while processes work on exiting. this pause is
    # at the top, so we can also pause right after the killpg()

    # go through all pgrps to reap them
    for pgrp in pgrp_list[:]:
      # loop quickly to clean everything in this pgrp
      while 1:
          pid, status = os.waitpid(-pgrp, os.WNOHANG)
        except IOError, e:
          if e.errno == errno.ECHILD:
            # no more children in this pgrp.
        if pid == 0:
	  # some stuff has not exited yet, and WNOHANG avoided
	  # blocking. go ahead and move to the next pgrp.

That should clean up everything. If stuff *still* hasn't exited, then
there isn't much you can do. But you will have tried :-)

Hope that helps! EdK and I haven't built test cases for the above, but it
has been doubly-reviewed, so we think the algorithm/code should work.


p.s. note that we aren't on general@gump, so CC: if you reply...

Greg Stein,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message