spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shane knapp <>
Subject Re: File JIRAs for all flaky test failures
Date Wed, 15 Feb 2017 20:50:53 GMT
it's not an open-file limit -- i have the jenkins workers set up w/a soft
file limit of 100k, and a hard limit of 200k.

On Wed, Feb 15, 2017 at 12:48 PM, Armin Braun <> wrote:

> I think one thing that is contributing to this a lot too is the general
> issue of the tests taking up a lot of file descriptors (10k+ if I run them
> on a standard Debian machine).
> There are a few suits that contribute to this in particular like
> `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few
> others, appears to consume a lot of fds.
> Wouldn't it make sense to open JIRAs about those and actively try to
> reduce the resource consumption of these tests?
> Seems to me these can cause a lot of unpredictable behavior (making the
> reason for flaky tests hard to identify especially when there's timeouts
> etc. involved) + they make it prohibitively expensive for many to test
> locally imo.
> On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <>
> wrote:
>> I was working on something to address this a while ago
>> but the difficulty in
>> testing locally made things a lot more complicated to fix for each of the
>> unit tests, should we resurface this JIRA again, I would whole heartedly
>> agree with the flakiness assessment of the unit tests.
>> [SPARK-9487] Use the same num. worker threads in Scala ...
>> <>
>> In Python we use `local[4]` for unit tests, while in Scala/Java we use
>> `local[2]` and `local` for some unit tests in SQL, MLLib, and other
>> components. If the ...
>> ------------------------------
>> *From:* Kay Ousterhout <>
>> *Sent:* Wednesday, February 15, 2017 12:10 PM
>> *To:*
>> *Subject:* File JIRAs for all flaky test failures
>> Hi all,
>> I've noticed the Spark tests getting increasingly flaky -- it seems more
>> common than not now that the tests need to be re-run at least once on PRs
>> before they pass.  This is both annoying and problematic because it makes
>> it harder to tell when a PR is introducing new flakiness.
>> To try to clean this up, I'd propose filing a JIRA *every time* Jenkins
>> fails on a PR (for a reason unrelated to the PR).  Just provide a quick
>> description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or
>> "Tests failed because 250m timeout expired", a link to the failed build,
>> and include the "Tests" component.  If there's already a JIRA for the
>> issue, just comment with a link to the latest failure.  I know folks don't
>> always have time to track down why a test failed, but this it at least
>> helpful to someone else who, later on, is trying to diagnose when the issue
>> started to find the problematic code / test.
>> If this seems like too high overhead, feel free to suggest alternative
>> ways to make the tests less flaky!
>> -Kay

View raw message