hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmed Hussein...@ahussein.me>
Subject Re: Fixing flaky tests in Apache Hadoop
Date Fri, 23 Oct 2020 20:58:57 GMT
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)


How about the following idea:

We set a monthly window during which only Unit test fixes can be merged.
Any other commit that is not directly
linked to Junit test failures would be blocked until the end of this
"Bug-Window".
For example, we set "Bug-days" to be from 25th to 31st of each month. All
commits during those days are meant to
fix and improve the testing environment.

Any thoughts?

On Thu, Oct 22, 2020 at 11:53 PM Ahmed Hussein <a@ahussein.me> wrote:

> Thank you Akira and We-Chiu.
> IMHO, the citation is more than just flaky tests. It has more depth:
> - Every developer stays committed to keep the code healthy.
> - Those flaky tests are actually "*bugs*" that need to be fixed. It is
> evident that there is a major problem in handling
>   the resources as I will explain below.
>
> 1. Other projects such as HBase have a tool to exclude flaky tests from
>> being executed. They track flaky tests and display them in a dashboard.
>> This will allow good tests to pass while leaving time for folks to fix
>> them. Or we could manually exclude tests (this is what we used to do at
>> Cloudera)
>>
>
> I like the idea of having a tool that gives a view of broken tests.
>
>  I spent a long time converting HDFS flaky tests into sub-tasks under
> HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
> believe there are still tons
> on the loose.
> I remember I explored a tool called DeFlaker
> <https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
> Then it reruns the tests to verify that they still
> pass.
>
> I do not think we necessarily want to exclude the flaky tests, but at
> least they should be enumerated and addressed
> regularly because they are after all "bugs". Having few flaky tests that
> cause everything to blow up indicates
> that there is a major problem with handling resources.
> I pointed out this issue in YARN-10334
> <https://issues.apache.org/jira/browse/YARN-10334> where I found that
> TestDistributedShell is nothing but a black hole that sucks
> all the resources memory/cpu/port of resources.
> Another example, I ran some few Unit tests on my local machine. In less
> than an hour, I found that there are 6 java
> processes still listening to ports.
>
> The point is flaky tests should not be *undermined* for such a long time
> as they could be indicators of a serious bug.
> In this current situation, we should find what is eating all those
> resources.
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> day two years ago, and maybe it's time to repeat it again:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>>  this
>> is going to be tricky as we are in a pandemic and most of the community
>> are
>> working from home, unlike the last time when we can lock ourselves in a
>> conference room and force everybody to work :)
>>
> This sounds fun and I like it actually but I doubt it is feasible to apply
> :)
>
> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
> As I mentioned in my response on the first point, a black-hole is created
> once the tests are triggered.
> I could not even run TestDistibutedShell on my local machine. The tests
> run out of everything  after the first 11 unit tests.
> It takes only 1 unit to fail to break the rest.
>
> On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <weichiu@apache.org>
> wrote:
>
>> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
>> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <weichiu@apache.org>
>> wrote:
>>
>> > Thanks for raising the issue, Akira and Ahmed,
>> >
>> > Fixing flaky tests is a thankless job so I want to take this opportunity
>> > to recognize the time and effort.
>> >
>> > We will always have flaky tests due to bad tests or simply infra issues.
>> > Fixing flaky tests will take time but if they are not addressed it
>> wastes
>> > everybody's time.
>> >
>> > Recognizing this problem, I have two suggestions:
>> >
>> > 1. Other projects such as HBase have a tool to exclude flaky tests from
>> > being executed. They track flaky tests and display them in a dashboard.
>> > This will allow good tests to pass while leaving time for folks to fix
>> > them. Or we could manually exclude tests (this is what we used to do at
>> > Cloudera)
>> >
>> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> > day two years ago, and maybe it's time to repeat it again:
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>> this
>> > is going to be tricky as we are in a pandemic and most of the community
>> are
>> > working from home, unlike the last time when we can lock ourselves in a
>> > conference room and force everybody to work :)
>> >
>> > Thoughts?
>> >
>> >
>> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aajisaka@apache.org>
>> > wrote:
>> >
>> >> Hi Hadoop developers,
>> >>
>> >> Now there are a lot of failing unit tests and there is an issue to
>> >> tackle this bad situation.
>> >> https://issues.apache.org/jira/browse/HDFS-15646
>> >>
>> >> Although this issue is in HDFS project, this issue is related to all
>> >> the Hadoop developers. Please check the above URL, read the
>> >> description, and volunteer to dedicate more time to fix flaky tests.
>> >> Your contribution to fixing the flaky tests will be really
>> >> appreciated!
>> >>
>> >> Thank you Ahmed Hussein for your report.
>> >>
>> >> Regards,
>> >> Akira
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> >> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>> >>
>> >>
>>
>
>
> --
> Best Regards,
>
> *Ahmed Hussein, PhD*
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message