airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Beauchemin <maximebeauche...@gmail.com>
Subject Re: Faster builds on CI + increased stability + easier to reproduce CI problems
Date Mon, 24 Aug 2020 15:41:14 GMT
Great work! Investments in CI pay dividends to the whole community.

On Sat, Aug 22, 2020 at 8:12 AM Jarek Potiuk <Jarek.Potiuk@polidea.com>
wrote:

> Hello everyone,
>
> Just wanted to let you know that we merged last week quite an overhaul of
> the CI architecture we have in Github Actions.
>
> TL;DR; It should be faster, more stable and it should be super-easy to
> reproduce any CI failure locally.
>
> We should have quite a bit faster, much more stable - and as a side effect
> - easy to diagnose CI builds. There are few PRs left to merge - solving
> some teething problems and adding some optimizations and we might need to
> implement one workaround for missing GitHub API, but it looks pretty good
> after few days of watching.
>
> The gist of the change is that we could start using a new "workflow_run"
> feature of GitHub Actions that allows us to only build each image once and
> reuse it for all the jobs - previously those images were built (using
> latest sources) for every single job. Now they are built only once.
>
> Some stats for average runs (we have way bigger gains in situations where
> python released new patch-level version):
>
>    - Prepare image job: 5 minutes 30 seconds -> 1 minute 7 seconds (~80%
>    improvement)
>    - Longest job time: 34 minutes => 29 minutes 30 seconds (~15%
>    improvement in longest job)
>    - Build time saved per build (!)  = 27 jobs * 4.5 minutes ~ 2h machine
>    build time saved for each build (!)
>
> This change also should improve overall stability. There were a number of
> problems where building image failed - this should be now ~ 10 x less
> likely to happen as we build images only 3 times instead of ~30.
>
> As a result - we are better citizens, but also it means we should have far
> less queuing time in case several PRs start in quick succession.
>
> Also - as a side effect but an important one - we have now a super-easy way
> to reproduce any failure in CI. This is the final setup which I thought
> about when I implemented Breeze. Now anyone can just log in to GitHub
> registry and run this:
>
> `breeze --github-image-id <RUN_ID> --backend <BACKEND> --python <X.Y>`
>
> Then you should be dropped into the EXACT same environment that was used
> for a particular failed "run" in Github Actions - including airflow sources
> used for that. You do not have to check-out the code etc.
>
> This means that you (or anyone else trying to help) should be able to
> re-run most of the failed tests locally and reproduce the failures (and try
> to fix them).
>
> Documentation with all the details and command you can use is coming in
> https://github.com/apache/airflow/pull/10380 - happy to get some reviews.
>
> J.
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message