trafodion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qifan Chen <qifan.c...@esgyn.com>
Subject Re: [Urgent Help] Trafodion Build Environment Problem
Date Tue, 08 Sep 2015 17:49:34 GMT
If there is a collision, the run-time stats data from two or more processes
will be mixed.

On Tue, Sep 8, 2015 at 12:40 PM, Eric Owhadi <eric.owhadi@esgyn.com> wrote:

> Would there be a huge problem to add a modulus 65535 to avoid this without
> moving to a hash and get performance impact?
> Eric
>
> -----Original Message-----
> From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> Sent: Tuesday, September 8, 2015 12:27 PM
> To: dev@trafodion.incubator.apache.org
> Cc: Lijian (Q) <jianli.li@huawei.com>
> Subject: RE: [Urgent Help] Trafodion Build Environment Problem
>
> The whole Trafodion stack may not have been tested for pids more than 65K.
> However, the problems with pids more than 65k will be first observed by
> mxssmp or mxsscp processes and it dumps core. These processes provide the
> capability to trouble shoot problems with query execution in Trafodion
> infrastructure by providing real time execution statistics.  Every
> Trafodion
> SQL processes is registered when it calls Trafodion SQL Cli calls and
> unregisters itself when it goes away. Internally, we use array for this
> purpose for performance reasons.
>
> Selva
>
> -----Original Message-----
> From: Qifan Chen [mailto:qifan.chen@esgyn.com]
> Sent: Tuesday, September 8, 2015 10:03 AM
> To: dev <dev@trafodion.incubator.apache.org>
> Cc: Lijian (Q) <jianli.li@huawei.com>
> Subject: Re: [Urgent Help] Trafodion Build Environment Problem
>
> For pids larger than 65K, we probably can use a hash table.  Thanks --Qifan
>
> On Tue, Sep 8, 2015 at 11:27 AM, Hans Zeller <hans.zeller@esgyn.com>
> wrote:
>
> > Hi Nieyuanyuan,
> >
> > Some of us are also working on running Trafodion in a sandbox or on
> > Apache objects. We hope to have documented steps on how to do that
> > eventually. You mention you had to fix several things. If you have
> > notes on what those are, would you share them?
> >
> > Thank you,
> >
> > Hans
> >
> > On Tue, Sep 8, 2015 at 9:19 AM, Amanda Moran <amanda.moran@esgyn.com>
> > wrote:
> >
> > > Hi there-
> > >
> > > This is fixed in latest version of installer.
> > >
> > > Thanks.
> > >
> > > Sent from my iPhone
> > >
> > > > On Sep 8, 2015, at 9:07 AM, Dave Birdsall
> > > > <dave.birdsall@esgyn.com>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I'm wondering if this should be reported as a problem? Perhaps
> > > Nieyuanyuan
> > > > would like to open a JIRA about supporting higher PID numbers in
> > > Trafodion?
> > > >
> > > > Dave
> > > >
> > > > -----Original Message-----
> > > > From: Narendra Goyal [mailto:narendra.goyal@esgyn.com]
> > > > Sent: Monday, September 7, 2015 7:04 PM
> > > > To: dev@trafodion.incubator.apache.org
> > > > Cc: Lijian (Q) <jianli.li@huawei.com>
> > > > Subject: RE: [Urgent Help] Trafodion Build Environment Problem
> > > >
> > > > Hi Nieyuanyuan,
> > > >
> > > > Could you please check the 'pid_max' settings:
> > > > sysctl -q kernel.pid_max
> > > > (or cat /proc/sys/kernel/pid_max)
> > > >
> > > > If the value is > 64K, I would recommend you set it to 64K, like so:
> > > > sudo sysctl -w kernel.pid_max=65535
> > > >
> > > > You will  have to restart Tradfodion and other Hadoop/HBase
> processes:
> > > > swstopall
> > > > ckillall
> > > > swstartall
> > > > sqstart
> > > >
> > > > Just fyi, to check the list of Trafodion processes only, please
> > > > run
> > > 'cstat'
> > > > on your bash.
> > > >
> > > > Thanks,
> > > > -Narendra
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Nieyuanyuan [mailto:nieyuanyuan@huawei.com]
> > > > Sent: Monday, September 7, 2015 6:40 PM
> > > > To: dev@trafodion.incubator.apache.org
> > > > Cc: Lijian (Q) <jianli.li@huawei.com>
> > > > Subject: [Urgent Help] Trafodion Build Environment Problem
> > > >
> > > > Dear Guys,
> > > >
> > > > I recently downloaded trafodion 1.1 from
> > > > https://github.com/apache/incubator-trafodion/tree/stable/1.1, and
> > > followed
> > > > the build guide from
> > > > https://wiki.trafodion.org/wiki/index.php/Building_the_Software,
> > > > and
> > > solved
> > > > a lot of problems (no need to list all details), I am able to run
> > > trafodion
> > > > over a hadoop sandbox environment.
> > > >
> > > > But I got a serious problem, that is, all Trafodion related
> > > > process
> > will
> > > go
> > > > down after several minutes (not sure how long), only few of them
> > > > will
> > > > left:
> > > > [nieyy@redhat-72 ~]$ ps ux
> > > > USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME
> > COMMAND
> > > > nieyy     76554  0.1  0.1 590988 139768 pts/6   Sl   19:14   0:04
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
> > > > -XX:OnOutOfMemoryError=kill -9 %p -Xmx128m
> > > > nieyy    118833  0.7  0.3 1535452 420996 ?      Sl   19:40   0:12
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
> > > > -Dproc_namenode -Xmx1000m
> > > > -Djava.net.prefe
> > > > nieyy    119085  0.6  0.2 1572688 367388 ?      Sl   19:40   0:10
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
> > > > -Dproc_datanode -Xmx1000m
> > > > -Djava.net.prefe
> > > > nieyy    119320  0.4  0.2 1512656 340636 ?      Sl   19:41   0:07
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
> > > > -Dproc_secondarynamenode -Xmx1000m -Djava.
> > > > nieyy    119972  1.2  0.2 1708408 378536 pts/6  Sl   19:41   0:20
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
> > > > -Dproc_resourcemanager -Xmx1000m -Dhadoop.
> > > > nieyy    120133  0.9  0.2 1616388 309976 ?      Sl   19:41   0:16
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
> > > > -Dproc_nodemanager -Xmx1000m -Dhadoop.log.
> > > > nieyy    120371  0.0  0.0   9824  1772 pts/6    S    19:41   0:00
> > /bin/sh
> > > > ./bin/mysqld_safe
> > > >
> > --defaults-file=/home/nieyy/trafodion_build/incubator-trafodion-stable-1.
> > > > nieyy    120594  0.0  0.0 452604 89908 pts/6    Sl   19:41   0:01
> > > >
> > >
> > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sq
> > l/lo
> > > > cal_hadoop/mysql/bin/mysq
> > > > nieyy    120789  0.0  0.0   9692  1736 pts/6    S    19:41   0:00
> bash
> > > >
> > >
> > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sq
> > l/lo
> > > > cal_hadoop/hbase/bin
> > > > nieyy    120806  2.0  0.3 1809048 509164 pts/6  Sl   19:41   0:34
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
> > > > -Dproc_master -XX:OnOutOfMemoryError=kill
> > > > nieyy    122554  0.0  0.0  13624  1304 pts/6    S    19:41   0:00
> > mpirun
> > > > -disable-auto-cleanup -demux select -env SQ_IC TCP -env
> > > > MPI_ERROR_LEVEL
> > > > 2 -env SQ_PIDMAP 1 -
> > > > nieyy    122555  0.0  0.0      0     0 ?        Zs   19:41   0:00
> > > > [hydra_pmi_proxy] <defunct>
> > > > nieyy    122556  1.0  0.0 335212 36748 ?        Ssl  19:41   0:17
> > > >
> > >
> > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/ex
> > port
> > > > /bin64d/monitor COLD
> > > > nieyy    122557  0.8  0.0 335212 36768 ?        Ssl  19:41   0:14
> > > >
> > >
> > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/ex
> > port
> > > > /bin64d/monitor COLD
> > > > nieyy    123946  0.9  0.1 828072 223088 pts/6   Sl   19:42   0:14
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
> > > > -XX:OnOutOfMemoryError=kill -9 %p -Xmx128m
> > > > nieyy    124044  1.0  0.1 629200 187180 pts/6   Sl   19:42   0:16
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
> > > > -XX:OnOutOfMemoryError=kill -9 %p -Xmx128m
> > > >
> > > > And then I need to kill all processes and use swstartall and
> > > > sqstart to reset the environment, however, the environment will
> > > > still go down
> > after
> > > a
> > > > while, and I need to restart again.
> > > >
> > > > I found some cores under
> > > > trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sql/script
> > > > s,
> > all
> > > > cored were generated by mxssmp:
> > > > [nieyy@redhat-72 scripts]$ ll core* ...
> > > > -rw------- 1 nieyy nieyy 156008448 Sep  7 17:56 core.mxssmp.173357
> > > > -rw------- 1 nieyy nieyy 145518592 Sep  7 17:56 core.mxssmp.173372
> > > > -rw------- 1 nieyy nieyy 156008448 Sep  7 19:24 core.mxssmp.74146
> > > > -rw------- 1 nieyy nieyy 145518592 Sep  7 19:24 core.mxssmp.74197
> > > >
> > > > I used gdb to track the stack:
> > > > [nieyy@redhat-72 scripts]$ gdb
> > > >
> > >
> > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sql/li
> > b/li
> > > > nux/64bit/debug/mxssmp ./core.mxssmp.141469 ...
> > > > (gdb) where
> > > > #0  0x000000000044166c in ProcessStats::getHeap (this=0x2000) at
> > > > ../runtimestats/SqlStats.h:271
> > > > #1  0x000000000043990a in StatsGlobals::removeProcess
> > > > (this=0x10000000, pid=65536, calledAtAdd=0) at
> > > > ../runtimestats/SqlStats.cpp:276
> > > > #2  0x0000000000439e05 in StatsGlobals::checkForDeadProcesses
> > > > (this=0x10000000, myPid=141469) at
> > > > ../runtimestats/SqlStats.cpp:382
> > > > #3  0x00000000004440be in SsmpGlobals::work (this=0x7f062660c7e8)
> > > > at
> > > > ../runtimestats/ssmpipc.cpp:582
> > > > #4  0x000000000042f06a in runServer (argc=1, argv=0x7fff5b0e5a48)
> > > > at
> > > > ../bin/ex_ssmp_main.cpp:259
> > > > #5  0x000000000042eb12 in main (argc=1, argv=0x7fff5b0e5a48) at
> > > > ../bin/ex_ssmp_main.cpp:127
> > > >
> > > > Then I searched via Google, and found a link
> > > > https://bugs.launchpad.net/trafodion/+bug/1368891 which looks
> > > > similar,
> > > but
> > > > it claimed the bug has been fixed at v0.9, but my version is 1.1.
> > > >
> > > > So, could you kindly help me to solve this problem cause I can't
> > > > find
> > > more
> > > > useful information via Google.
> > > >
> > > > Thanks a lot.
> > >
> >
>
>
>
> --
> Regards, --Qifan
>



-- 
Regards, --Qifan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message