From common-issues-return-69418-apmail-hadoop-common-issues-archive=hadoop.apache.org@hadoop.apache.org Tue Sep 16 15:12:10 2014 Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 36F5311244 for ; Tue, 16 Sep 2014 15:12:10 +0000 (UTC) Received: (qmail 38978 invoked by uid 500); 16 Sep 2014 15:11:51 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 38924 invoked by uid 500); 16 Sep 2014 15:11:51 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 38886 invoked by uid 99); 16 Sep 2014 15:11:51 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Sep 2014 15:11:51 +0000 Date: Tue, 16 Sep 2014 15:11:51 +0000 (UTC) From: "Yongjun Zhang (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-11045) Introducing a tool to detect flaky tests of hadoop jenkins test job MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135581#comment-14135581 ] Yongjun Zhang commented on HADOOP-11045: ---------------------------------------- I checked PreCommit-HDFS-Build, and here is the result. It says testPipelineRecoveryStress is the topmost (HDFS-6694), and without solving it, we might hide some real problem. The second and the third tests in the list below failed for the similar reason "Too many open files...". It's suspicious because this is not the case before. Some code change might have introduced this problem recently (just filed HDFS-7070). {code} ****Recently FAILED builds in url: https://builds.apache.org//job/PreCommit-HDFS-Build THERE ARE 18 builds (out of 20) that have failed tests in the past 3 days, as listed below: ...... Among 20 runs examined, all failed tests <#failedRuns: testName>: 8: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testPipelineRecoveryStress 6: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testResponseCode 2: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testRenameDirToSelf 2: org.apache.hadoop.ha.TestZKFailoverControllerStress.testExpireBackAndForth 2: org.apache.hadoop.fs.contract.localfs.TestLocalFSContractOpen.testFsIsEncrypted 2: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testOverWriteAndRead 2: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testOutputStreamClosedTwice 2: org.apache.hadoop.fs.contract.rawlocal.TestRawlocalContractOpen.testFsIsEncrypted 2: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer.testStored 2: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testSeek 1: org.apache.hadoop.hdfs.TestDFSShell.testGet 1: org.apache.hadoop.hdfs.TestDFSUpgrade.testUpgrade 1: org.apache.hadoop.fs.TestFsShellCopy.testCopyNoCrc 1: org.apache.hadoop.crypto.key.TestValueQueue.testgetAtMostPolicyALL 1: org.apache.hadoop.hdfs.TestDFSShell.testCopyToLocal ...... {code} > Introducing a tool to detect flaky tests of hadoop jenkins test job > ------------------------------------------------------------------- > > Key: HADOOP-11045 > URL: https://issues.apache.org/jira/browse/HADOOP-11045 > Project: Hadoop Common > Issue Type: Improvement > Components: build, tools > Affects Versions: 2.5.0 > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HADOOP-11045.001.patch, HADOOP-11045.002.patch > > > File this jira to introduce a tool to detect flaky tests of hadoop jenkins test jobs. Certainly it can be adapted to projects other than hadoop. > I developed the tool on top of some initial work [~tlipcon] did. We find it quite useful. With Todd's agreement, I'd like to push it to upstream so all of us can share (thanks Todd for the initial work and support). I hope you find the tool useful too. > The idea is, when one has the need to see if the test failure s/he is seeing in a pre-build jenkins run is flaky or not, s/he could run this tool to get a good idea. Also, if one wants to look at the failure trend of a testcase in a given jenkins job, the tool can be used too. I hope people find it useful. > This tool is for hadoop contributors rather than hadoop users. Thanks [~tedyu] for the advice to put to dev-support dir. > Description of the tool: > {code} > # > # Given a jenkins test job, this script examines all runs of the job done > # within specified period of time (number of days prior to the execution > # time of this script), and reports all failed tests. > # > # The output of this script includes a section for each run that has failed > # tests, with each failed test name listed. > # > # More importantly, at the end, it outputs a summary section to list all failed > # tests within all examined runs, and indicate how many runs a same test > # failed, and sorted all failed tests by how many runs each test failed in. > # > # This way, when we see failed tests in PreCommit build, we can quickly tell > # whether a failed test is a new failure or it failed before, and it may just > # be a flaky test. > # > # Of course, to be 100% sure about the reason of a failed test, closer look > # at the failed test for the specific run is necessary. > # > {code} > How to use the tool: > {code} > Usage: determine-flaky-tests-hadoop.py [options] > Options: > -h, --help show this help message and exit > -J JENKINS_URL, --jenkins-url=JENKINS_URL > Jenkins URL > -j JOB_NAME, --job-name=JOB_NAME > Job name to look at > -n NUM_PREV_DAYS, --num-days=NUM_PREV_DAYS > Number of days to examine > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)