trafodion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Selva Govindarajan <selva.govindara...@esgyn.com>
Subject Re: Trafodion Maser daily build failures
Date Tue, 21 Mar 2017 19:24:56 GMT
Thanks Arvind and Steve for following it up. I had said RMS uses port number.  Actually,  the
segment id is obtained from the foundation layer and used in the semaphore name.


SEG_ID getStatsSegmentId()

{
  Int32 segid;
  Int32 error;
  if (gStatsSegmentId_ == -1)
  {
   error = msg_mon_get_my_segid(&segid);
   assert(error == 0);//XZFIL_ERR_OK);
   gStatsSegmentId_ = segid + RMS_SEGMENT_ID_OFFSET;
  }
  return gStatsSegmentId_;
}

RMS gets it once and stores the created semaphore name for use later. I think process Id can
also be used in case of monitor because the semaphore is valid only as long as the monitor
is alive. In case of RMS, semaphore name needs to remain the same even RMS processes are restarted
as long as the node is UP.

Selva

Selva


________________________________
From: Arvind N <narain.arvind@gmail.com>
Sent: Tuesday, March 21, 2017 12:03:22 PM
To: dev@trafodion.incubator.apache.org
Cc: Steve Varnau; Selva Govindarajan
Subject: RE: Trafodion Maser daily build failures

Steve modified the scripts to print out the contents of /dev/shm before
install and after uninstall. As per the following it does seem that it is a
leftover semaphore in /dev/shm from previous build.

Did notice that the failures are restricted to hdp environment. Happens in
an environment where the slave system was first used by a daily build for
Release2.1 (leaves files in /dev/shm for id 1003) and then the same is used
for daily build for master. Maybe the logic of finding the next available id
is different in the py installer vs bash installer ?

Suggestion from Selva to attach process ID to the semaphore name should
clear this problem.


                From master daily build:


https://jenkins.esgyn.com/job/core-regress-privs1-hdp/505/console

                 AHW 2.3 (i-014c7dcfa0719ec26)

                2017-03-21 09:18:58 === Tue Mar 21 09:18:58 UTC 2017:
/usr/local/bin/install-traf.sh
                2017-03-21 09:18:58 === Setting up Trafodion
                2017-03-21 09:18:58
========================================================
                2017-03-21 09:18:58 Source
/home/jenkins/workspace/core-regress-privs1-hdp/trafodion/core/sqf/conf/inst
all_features
                2017-03-21 09:18:58 Java for Trafodion install:
/usr/lib/jvm/java-1.7.0-openjdk.x86_64
                2017-03-21 09:18:58 Saving output in Install_Start.log
                2017-03-21 09:18:58 + chmod o+r
/home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
e-trafodion_installer-2.2.0-incubating.tar.gz
/home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
e-trafodion_server-2.2.0-RH6-x86_64-incubating.tar.gz
/home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
e-trafodion-regress.tgz
                2017-03-21 09:18:58 + echo 'Checking shared mem'
                2017-03-21 09:18:58 Checking shared mem
                2017-03-21 09:18:58 + ls -ld /dev/shm
                2017-03-21 09:18:58 drwxrwxrwt 2 root root 100 Mar 21 09:18
/dev/shm
                2017-03-21 09:18:58 + ls -l /dev/shm
                2017-03-21 09:18:58 total 12
                2017-03-21 09:18:58 -rw-r--r-- 1 1003 509 32 Mar 21 08:03
sem.monitor.sem.trafodion
                2017-03-21 09:18:58 -rw------- 1 1003 509 32 Mar 21 08:03
sem.rms.1003.268468606
                2017-03-21 09:18:58 -rw------- 1 1003 509 32 Mar 21 08:03
sem.rms.1003.268490614
                2017-03-21 09:18:58 + echo ============
                2017-03-21 09:18:58 ============

                Leftover from the Release 2.1 build:


https://jenkins.esgyn.com/job/phoenix_part2_T4-hdp/580/consoleFull - 2.1
build

                2017-03-21 09:16:05 *********************************
                2017-03-21 09:16:05   Trafodion Uninstall Completed
                2017-03-21 09:16:05 *********************************
                2017-03-21 09:16:05 + uninst_ret=0
                2017-03-21 09:16:05 + sudo rm -f
/home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
                2017-03-21 09:16:05 + sudo mv
/home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run.save
/home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
                2017-03-21 09:16:05 + sudo chmod -R a+rX
/home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
                2017-03-21 09:16:05 + exit 0
                2017-03-21 09:16:05 + rc=0
                2017-03-21 09:16:05 + echo 'Checking shared mem'
                2017-03-21 09:16:05 Checking shared mem
                2017-03-21 09:16:05 + ls -ld /dev/shm
                2017-03-21 09:16:05 drwxrwxrwt 2 root root 100 Mar 21 09:15
/dev/shm
                2017-03-21 09:16:05 + ls -l /dev/shm
                2017-03-21 09:16:05 total 12
                2017-03-21 09:16:05 -rw-r--r-- 1 1003 509 32 Mar 21 08:03
sem.monitor.sem.trafodion
                2017-03-21 09:16:05 -rw------- 1 1003 509 32 Mar 21 08:03
sem.rms.1003.268468606
                2017-03-21 09:16:05 -rw------- 1 1003 509 32 Mar 21 08:03
sem.rms.1003.268490614
                2017-03-21 09:16:05 + echo ============
                2017-03-21 09:16:05 ============
                2017-03-21 09:16:05 + exit 0
                2017-03-21 09:16:05 + exit 0


Regards
Arvind

-----Original Message-----
From: Narendra Goyal [mailto:narendra.goyal@esgyn.com]
Sent: Friday, March 17, 2017 2:39 PM
To: dev@trafodion.incubator.apache.org
Subject: RE: Trafodion Maser daily build failures

Checked the /dev/shm directory on the build machine and that was empty. I
was able to create a file /dev/shm/foo (as the 'trafodion' user id) - so,
does not look like a permissions issue (on /dev/shm at least).

I am not sure whether any build has happened on that build machine but do
not see any orphan semaphore in /dev/shm.

Thanks,
-Narendra

-----Original Message-----
From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
Sent: Friday, March 17, 2017 11:07 AM
To: dev@trafodion.incubator.apache.org
<mailto:dev@trafodion.incubator.apache.org>
Subject: Trafodion Maser daily build failures

First, I changed the subject line  so that this message doesn't get filtered
out. Trafodion master daily build has been failing randomly with the
following stack trace in monitor.


(gdb) bt
#0  0x00007feaee0eb625 in raise () from /lib64/libc.so.6
#1  0x00007feaee0ece05 in abort () from /lib64/libc.so.6
#2  0x000000000041f8b3 in CProcessContainer::CProcessContainer
(this=0x270e340, nodeContainer=<value optimized out>) at process.cxx:3389
#3  0x00000000004569cc in CNode::CNode (this=0x270e340, name=0x26e9548
"slave-ahw23", pnid=0, rank=0) at pnode.cxx:152
#4  0x0000000000458050 in CNodeContainer::AddNodes (this=<value optimized
out>) at pnode.cxx:1572
#5  0x0000000000419185 in CCluster::InitializeConfigCluster (this=0x2712270)
at cluster.cxx:2818
#6  0x0000000000419e25 in CCluster::CCluster (this=0x2712270) at
cluster.cxx:597
#7  0x000000000043473a in CTmSync_Container::CTmSync_Container
(this=0x2712270) at tmsync.cxx:137
#8  0x0000000000408f36 in CMonitor::CMonitor (this=0x2712270, procTermSig=9)
at monitor.cxx:329
#9  0x000000000040a5ab in main (argc=2, argv=0x7ffd157c0b48) at
monitor.cxx:1308
(gdb) f 2

The monitor log shows
2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process
Name: $MONITOR,,, TID: 17918, Message ID: 101020103, [CMonitor::main],
monitor Version 1.0.1 prodver Release 2.2.0 (Build release
[2.0.1rc3-1425-g6155ff1_Bld883], branch 6155ff1_no_branch, date
20170316_0832), Started! CommType: Sockets
2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process
Name: $MONITOR,,, TID: 17918, Message ID: 101010401, [CCluster::CCluster]
Validation of node down is disabled
2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process
Name: $MONITOR,,, TID: 17918, Message ID: 101030703,
[CProcessContainer::CProcessContainer], Can't create semaphore
/monitor.sem.trafodion! (Permission denied)
2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process
Name: $MONITOR,,, TID: 17918, Message ID: 101030704,
[CProcessContainer::CProcessContainer], Can't unlink semaphore
/monitor.sem.trafodion! (Permission denied)

I came up with the following theory

When a semaphore is created, a device file with the given semaphore name is
created at /dev/shm by the process. The process owner needs to have write
permission to create this file.  Initially I suspected it is permission
issue of /dev/shm directory.

I just looked at /dev/shm in the Jenkins VM. It did have the write
permission.

 If that's the case, it is possible the previous semaphore was not cleaned
up correctly.  The monitor seems to create the semaphore with
/dev/shm/sem.monitor.<user_name>. If trafodion gets the different uid
between two different runs, it is possible that it is unable to clean it up.
In case of RMS, we attach the port number to the semaphore name so that
every run from the same user name will get a different semaphore name.

---------------------

sem_open document shows

EACCES The semaphore exists, but the caller does not have permission
              to open it

EACCES is 13 the errno returned in the gdb.

Please offer your help to resolve this issue if you have any other idea.

Selva


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message