hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11045) Introducing a tool to detect flaky tests of hadoop jenkins test job
Date Tue, 16 Sep 2014 15:11:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135581#comment-14135581

Yongjun Zhang commented on HADOOP-11045:

I checked PreCommit-HDFS-Build, and here is the result. It says testPipelineRecoveryStress
is the topmost (HDFS-6694), and without solving it, we might hide some real problem.

The second and the third tests in the list below failed for the similar reason "Too many open
files...". It's suspicious because this is not the case before. Some code change might have
introduced this problem recently (just filed HDFS-7070).

****Recently FAILED builds in url: https://builds.apache.org//job/PreCommit-HDFS-Build
    THERE ARE 18 builds (out of 20) that have failed tests in the past 3 days, as listed below:
Among 20 runs examined, all failed tests <#failedRuns: testName>:
    8: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testPipelineRecoveryStress
    6: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testResponseCode
    2: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testRenameDirToSelf
    2: org.apache.hadoop.ha.TestZKFailoverControllerStress.testExpireBackAndForth
    2: org.apache.hadoop.fs.contract.localfs.TestLocalFSContractOpen.testFsIsEncrypted
    2: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testOverWriteAndRead
    2: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testOutputStreamClosedTwice
    2: org.apache.hadoop.fs.contract.rawlocal.TestRawlocalContractOpen.testFsIsEncrypted
    2: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer.testStored
    2: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract.testSeek
    1: org.apache.hadoop.hdfs.TestDFSShell.testGet
    1: org.apache.hadoop.hdfs.TestDFSUpgrade.testUpgrade
    1: org.apache.hadoop.fs.TestFsShellCopy.testCopyNoCrc
    1: org.apache.hadoop.crypto.key.TestValueQueue.testgetAtMostPolicyALL
    1: org.apache.hadoop.hdfs.TestDFSShell.testCopyToLocal

> Introducing a tool to detect flaky tests of hadoop jenkins test job
> -------------------------------------------------------------------
>                 Key: HADOOP-11045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11045
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: build, tools
>    Affects Versions: 2.5.0
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HADOOP-11045.001.patch, HADOOP-11045.002.patch
> File this jira to introduce a tool to detect flaky tests of hadoop jenkins test jobs.
Certainly it can be adapted to projects other than hadoop.
> I developed the tool on top of some initial work [~tlipcon] did. We find it quite useful.
With Todd's agreement, I'd like to push it to upstream so all of us can share (thanks Todd
for the initial work and support). I hope you find the tool useful too.
> The idea is, when one has the need to see if the test failure s/he is seeing in a pre-build
jenkins run is flaky or not, s/he could run this tool to get a good idea. Also, if one wants
to look at the failure trend of a testcase in a given jenkins job, the tool can be used too.
I hope people find it useful.
> This tool is for hadoop contributors rather than hadoop users. Thanks [~tedyu] for the
advice to put to dev-support dir.
> Description of the tool:
> {code}
> #
> # Given a jenkins test job, this script examines all runs of the job done
> # within specified period of time (number of days prior to the execution
> # time of this script), and reports all failed tests.
> #
> # The output of this script includes a section for each run that has failed
> # tests, with each failed test name listed.
> #
> # More importantly, at the end, it outputs a summary section to list all failed
> # tests within all examined runs, and indicate how many runs a same test
> # failed, and sorted all failed tests by how many runs each test failed in.
> #
> # This way, when we see failed tests in PreCommit build, we can quickly tell 
> # whether a failed test is a new failure or it failed before, and it may just 
> # be a flaky test.
> #
> # Of course, to be 100% sure about the reason of a failed test, closer look 
> # at the failed test for the specific run is necessary.
> #
> {code}
> How to use the tool:
> {code}
> Usage: determine-flaky-tests-hadoop.py [options]
> Options:
>   -h, --help            show this help message and exit
>   -J JENKINS_URL, --jenkins-url=JENKINS_URL
>                         Jenkins URL
>   -j JOB_NAME, --job-name=JOB_NAME
>                         Job name to look at
>   -n NUM_PREV_DAYS, --num-days=NUM_PREV_DAYS
>                         Number of days to examine
> {code}

This message was sent by Atlassian JIRA

View raw message