ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Horak (JIRA)" <>
Subject [jira] [Created] (AMBARI-9893) Ambari services should be properly daemonized
Date Tue, 03 Mar 2015 13:12:04 GMT
Daniel Horak created AMBARI-9893:

             Summary: Ambari services should be properly daemonized
                 Key: AMBARI-9893
             Project: Ambari
          Issue Type: Bug
          Components: ambari-agent, ambari-server
    Affects Versions: 1.6.1
         Environment: HDP 2.1 on RHEL 6
            Reporter: Daniel Horak
            Priority: Critical

Ambari services (_ambari-server_ and _ambari-agent_) are not properly demonized.

When any service start as daemon, it should _become a process group leader_ ([apart from other

h3. How to reproduce

1) Prepare simple test shell script:

# cat 
  #!/bin/bash -x
  ambari-server restart
  sleep 10
  ambari-server restart
  sleep 10
# chmod +x

This script should restart ambari-server two times (with some delay) and then
print date.

2) Run the test script.

The script doesn't behave as expected: the second _ambari-server restart_ kills
the whole script! See:

# ./ 
  + ambari-server restart
  Using python  /usr/bin/python2.6
  Restarting ambari-server
  Using python  /usr/bin/python2.6
  Stopping ambari-server
  Ambari Server stopped
  Using python  /usr/bin/python2.6
  Starting ambari-server
  Ambari Server running with 'root' privileges.
  Organizing resource files at /var/lib/ambari-server/resources...
  Waiting for server start...
  Server PID at: /var/run/ambari-server/
  Server out at: /var/log/ambari-server/ambari-server.out
  Server log at: /var/log/ambari-server/ambari-server.log
  Ambari Server 'start' completed successfully.
  + sleep 10
  + ambari-server restart
  Using python  /usr/bin/python2.6
  Restarting ambari-server
  Using python  /usr/bin/python2.6
  Stopping ambari-server
# echo $?

h3. Explanation

After the first {{ambari-server restart}} the _process group ID_ (_PGID_) of
ambari-server is the same as the _PGID_ of the test shell script. In other words
ambari-server belongs to the same process group as the test script
because ambari-server haven't became the _process group leader_.

Then 2nd {{ambari-server restart}} calls {{stop()}} function from
{{/usr/sbin/}} and this function kills all processes in the same
process group as ambari-server (code {{os.killpg(os.getpgid(pid), signal.SIGKILL)}}, where
{{pid}} is the pid of running ambari-server process).
There is nothing wrong with this assuming the ambari service daemon process
creates new process group for itself - which is not the case (and root cause of
the bug).

h3. Deeper debugging

You can check the PGIDs via the ps command: {{ps -e --forest -o pgrp,args}}.

You can also add following lines to the {{}} script after
the first {{ambari-server restart}} command:

echo "shell pid: $$"
ps -o pid,ppid,pgrp -p $(cat /var/run/ambari-server/

So that when you run the {{}} script again, you would be
able to see that the ambari-server process belongs to the process group of the
shell (PGRP aka PGID of the shell is the same as it's PID in this case):

+ echo 'shell pid: 9368'
shell pid: 9368
++ cat /var/run/ambari-server/
+ ps -o pid,ppid,pgrp -p 9415
 9415     1  9368

This message was sent by Atlassian JIRA

View raw message