trafodion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Varnau <steve.var...@esgyn.com>
Subject RE: Trafodion release2.0 Daily Test Result - 23 - Still Failing
Date Fri, 27 May 2016 14:56:48 GMT
RE: Trafodion release2.0 Daily Test Result - 23 - Still Failing

Wow, good find.  I thought 9000 ephemeral ports would be plenty, but
apparently not.  Or perhaps the compressor requesting ports has some
assumption about the range and is requesting specific port or range.
Either way, I’ll need to go back and reserve specific ports that cause us
problems rather than make the range much smaller.



--Steve



*From:* Arvind [mailto:narain.arvind@gmail.com]
*Sent:* Thursday, May 26, 2016 10:27 PM
*To:* dev@trafodion.incubator.apache.org; Steve Varnau <
steve.varnau@esgyn.com>
*Subject:* RE: Trafodion release2.0 Daily Test Result - 23 - Still Failing



Hi Steve,

It does seem to be related to ephemeral ports and/or tcp timeout settings
(time_wait state for 2 mins or something similar).

Other logs might indicate how many socket opens are being done for this
test, but the following log files (.3) show that we most probably ran out
of ephemeral ports, which does match with what you were suspecting.

http://traf-testlogs.esgyn.com/Requested/57/regress-cm5.4/traf_run/logs/trafodion.hdfs.log.3

http://traf-testlogs.esgyn.com/Requested/57/regress-cm5.4/traf_run/logs/trafodion.hdfs.log.2

http://traf-testlogs.esgyn.com/Requested/57/regress-cm5.4/traf_run/logs/trafodion.hdfs.log.1

http://traf-testlogs.esgyn.com/Requested/57/regress-cm5.4/traf_run/logs/trafodion.hdfs.log

2016-05-26 17:34:29,178 INFO compress.CodecPool: Got brand-new compressor
[.gz]

2016-05-26 17:36:11,728 INFO hdfs.DFSClient: Exception in
createBlockOutputStream

java.net.BindException: Cannot assign requested address

        at sun.nio.ch.Net.connect0(Native Method)

        at sun.nio.ch.Net.connect(Net.java:484)

        at sun.nio.ch.Net.connect(Net.java:476)

        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:675)

        at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)

        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)

        at
org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1622)

        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1420)

        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)

        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)

Regards

Arvind

-----Original Message-----
From: Steve Varnau [mailto:steve.varnau@esgyn.com <steve.varnau@esgyn.com>]
Sent: Thursday, May 26, 2016 12:12 PM
To: dev@trafodion.incubator.apache.org
Subject: RE: Trafodion release2.0 Daily Test Result - 23 - Still Failing

Tested out this theory. I tested PR502 (Selva's memory fix for TEST018)
against hive test, and it still fails:

https://jenkins.esgyn.com/job/Requested-Test/57/

Then I changed jenkins config to use the previous VM image and ran it
again, and it passed:

https://jenkins.esgyn.com/job/Requested-Test/59/

The only intentional change between those VM images was limiting the range
of ephemeral ports.

Perhaps some unintentional change also got in, otherwise I'm stumped how
that would cause this problem.

--Steve



> -----Original Message-----

> From: Steve Varnau [mailto:steve.varnau@esgyn.com <steve.varnau@esgyn.com>
]

> Sent: Thursday, May 26, 2016 9:11 AM

> To: 'dev@trafodion.incubator.apache.org'

> <dev@trafodion.incubator.apache.org>

> Subject: RE: Trafodion release2.0 Daily Test Result - 23 - Still

> Failing

>

> I think the error usually looks like that or more often it hangs and

> the test times out.

>

> The odd thing is that it started failing on both branches on the same day.

> There

> were changes on master branch, but none on the release2.0 branch.

> That is what makes me think the trigger was environmental rather than

> a code change.

>

> I guess I could switch jenkins back to using the previous VM image to

> see if it goes away.

>

> --Steve

>

>

> > -----Original Message-----

> > From: Sandhya Sundaresan [mailto:sandhya.sundaresan@esgyn.com
<sandhya.sundaresan@esgyn.com>]

> > Sent: Thursday, May 26, 2016 9:04 AM

> > To: dev@trafodion.incubator.apache.org

> > Subject: RE: Trafodion release2.0 Daily Test Result - 23 - Still

> > Failing

> >

> >  RE: Trafodion release2.0 Daily Test Result - 23 - Still Failing

> >

> > Hi Steve,

> >

> >    The error today is this :

> >

> >  *** ERROR[8448] Unable to access Hbase interface. Call to

> > ExpHbaseInterface::scanOpen returned error HBASE_OPEN_ERROR(-704).

> > Cause:

> >

> > > java.lang.Exception: Cannot create Table Snapshot Scanner

> >

> > > org.TRAFODION.sql.HTableClient.startScan(HTableClient.java:1003)

> >

> > We have seen this when there is  java memory pressure in the past.

> >

> > A few days back this same snapshot scan creation failed with this :

> > I wonder if anyone can see  pattern here or knows the causes of

> > either of these.

> >

> > >>--snapshot

> >

> > >>execute snp;

> >

> > *** ERROR[8448] Unable to access Hbase interface. Call to

> > ExpHbaseInterface::scanOpen returned error HBASE_OPEN_ERROR(-704).

> > Cause:

> >

> > java.io.IOException: java.util.concurrent.ExecutionException:

> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File

> >

> /bulkload/20160520102824/TRAFODION.HBASE.CUSTOMER_ADDRESS_SNAP11

> > 1/6695c6f9-4bb5-4ad5-893b-

> >

> adf07fc8a4b9/data/default/TRAFODION.HBASE.CUSTOMER_ADDRESS/7143c21

> > b40a7bef21768685f7dc18e1c/.regioninfo

> > could only be replicated to 0 nodes instead of minReplication (=1).

> > There

> > are 1 datanode(s) running and no node(s) are excluded in this operation.

> >

> >         at

> >

> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarge

> t

> > 4NewBlock(BlockManager.java:1541)

> >

> >         at

> >

> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock

> (F

> > SNamesystem.java:3289)

> >

> >         at

> >

> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(Nam

> > eNodeRpcServer.java:668)

> >

> >         at

> >

> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClien

> tPro

> > tocol.addBlock(AuthorizationProviderProxyClientProtocol.java:212)

> >

> >         at

> >

> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTran

> sla

> > torPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:483

> > )

> >

> >         at

> >

> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cli

> entN

> >

> amenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java

> )

> >

> >         at

> >

> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call

> (Pro

> > tobufRpcEngine.java:619)

> >

> >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)

> >

> >         at

> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)

> >

> >         at

> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)

> >

> >         at java.security.AccessController.doPrivileged(Native

> > Method)

> >

> >         at javax.security.auth.Subject.doAs(Subject.java:415)

> >

> >         at

> >

> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformat

> ion

> > .java:1671)

> >

> >         at

> > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)

> >

> >

> org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyReg

> ionU

> > tils.java:162)

> >

> >

> org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.cloneHdfsRegion

> s(R

> > estoreSnapshotHelper.java:561)

> >

> >

> org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.restoreHdfsRegi

> ons

> > (RestoreSnapshotHelper.java:237)

> >

> >

> org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.restoreHdfsRegi

> ons

> > (RestoreSnapshotHelper.java:159)

> >

> >

> org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.copySnapshotFor

> Sca

> > nner(RestoreSnapshotHelper.java:812)

> >

> >

> org.apache.hadoop.hbase.client.TableSnapshotScanner.init(TableSnapshot

> Scann

> > er.java:156)

> >

> >

> org.apache.hadoop.hbase.client.TableSnapshotScanner.<init>(TableSnapsh

> otSca

> > nner.java:124)

> >

> >

> org.apache.hadoop.hbase.client.TableSnapshotScanner.<init>(TableSnapsh

> otSca

> > nner.java:101)

> >

> >

> org.trafodion.sql.HTableClient$SnapshotScanHelper.createTableSnapshotS

> cann

> > er(HTableClient.java:222)

> >

> > org.trafodion.sql.HTableClient.startScan(HTableClient.java:1009)

> >

> > .

> >

> > --- 0 row(s) selected.

> >

> > >>log;

> >

> > Sandhya

> >

> > -----Original Message-----

> > From: Steve Varnau [mailto:steve.varnau@esgyn.com
<steve.varnau@esgyn.com>

> > <steve.varnau@esgyn.com>]

> > Sent: Thursday, May 26, 2016 8:49 AM

> > To: dev@trafodion.incubator.apache.org

> > Subject: RE: Trafodion release2.0 Daily Test Result - 23 - Still

> > Failing

> >

> > This hive regression behavior is still puzzling, however, I just

> > realized one thing that did change just before it started failing

> > and is a test environment change common to both branches.  The VM

> > image for cloudera was updated to set a smaller ephemeral port range

> > to reduce chance of port conflict that was occasionally impacting

> > HBase.

> >

> > The range was set to 51000 - 59999, to avoid default port numbers

> > that Cloudera distro uses.

> >

> > So how could this possibly be causing disaster in hive/TEST018?   I have

> > no

> >

> > idea.

> >

> > --Steve

> >

> > > -----Original Message-----

> >

> > > From: steve.varnau@esgyn.com [mailto:steve.varnau@esgyn.com
<steve.varnau@esgyn.com>

> > <steve.varnau@esgyn.com>]

> >

> > > Sent: Thursday, May 26, 2016 1:36 AM

> >

> > > To: dev@trafodion.incubator.apache.org

> >

> > > Subject: Trafodion release2.0 Daily Test Result - 23 - Still

> > > Failing

> >

> > >

> >

> > > Daily Automated Testing release2.0

> >

> > >

> >

> > > Jenkins Job:

> > > https://jenkins.esgyn.com/job/Check-Daily-release2.0/23/

> >

> > > Archived Logs: http://traf-testlogs.esgyn.com/Daily-release2.0/23

> >

> > > Bld Downloads: http://traf-builds.esgyn.com

> >

> > >

> >

> > > Changes since previous daily build:

> >

> > > No changes

> >

> > >

> >

> > >

> >

> > > Test Job Results:

> >

> > >

> >

> > > FAILURE core-regress-hive-cdh (55 min) SUCCESS

> > > build-release2.0-debug

> >

> > > (24 min) SUCCESS build-release2.0-release (28 min) SUCCESS

> >

> > > core-regress-charsets-cdh (28 min) SUCCESS

> > > core-regress-charsets-hdp

> >

> > > (41 min) SUCCESS core-regress-compGeneral-cdh (36 min) SUCCESS

> >

> > > core-regress-compGeneral-hdp (45 min) SUCCESS

> > > core-regress-core-cdh

> >

> > > (39 min) SUCCESS core-regress-core-hdp (1 hr 10 min) SUCCESS

> >

> > > core-regress-executor-cdh (56 min) SUCCESS

> > > core-regress-executor-hdp

> >

> > > (1 hr 25 min) SUCCESS core-regress-fullstack2-cdh (13 min) SUCCESS

> >

> > > core-regress-fullstack2-hdp (14 min) SUCCESS core-regress-hive-hdp

> > > (53

> >

> > > min) SUCCESS core-regress-privs1-cdh (39 min) SUCCESS

> >

> > > core-regress-privs1-hdp (59 min) SUCCESS core-regress-privs2-cdh

> > > (41

> >

> > > min) SUCCESS core-regress-privs2-hdp (54 min) SUCCESS

> >

> > > core-regress-qat-cdh (16 min) SUCCESS core-regress-qat-hdp (21

> > > min)

> >

> > > SUCCESS core-regress-seabase-cdh (57 min) SUCCESS

> >

> > > core-regress-seabase-hdp (1 hr 16 min) SUCCESS

> > > core-regress-udr-cdh

> >

> > > (28 min) SUCCESS core-regress-udr-hdp (31 min) SUCCESS

> > > jdbc_test-cdh

> >

> > > (22 min) SUCCESS jdbc_test-hdp (40 min) SUCCESS

> > > phoenix_part1_T2-cdh

> >

> > > (56 min) SUCCESS phoenix_part1_T2-hdp (1 hr 17 min) SUCCESS

> >

> > > phoenix_part1_T4-cdh (46 min) SUCCESS phoenix_part1_T4-hdp (57

> > > min)

> >

> > > SUCCESS phoenix_part2_T2-cdh (53 min) SUCCESS phoenix_part2_T2-hdp

> > > (1

> >

> > > hr 25 min) SUCCESS phoenix_part2_T4-cdh (44 min) SUCCESS

> >

> > > phoenix_part2_T4-hdp (1 hr 0 min) SUCCESS pyodbc_test-cdh (11 min)

> >

> > > SUCCESS pyodbc_test-hdp (23 min)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message