spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shane knapp ☠ <skn...@berkeley.edu>
Subject Re: [build system] IMPORTANT UPDATE
Date Wed, 25 Nov 2020 21:35:46 GMT
hey all, work is going quite well and smoothly for this project.

today's update:

we will experience significant downtime monday/tuesday as we spin up the
new primary jenkins node.  until then, we'll be building over the next few
days so i'll have a chance to better track down and fix any system-level
build breaks.

but most importantly, i just added 3 of the 4 new ubuntu 20.04 workers to
the pool:  research-jenkins-worker-03, 04 and 06.  -05 is being difficult,
so i'm going to let it pout in the corner for a while before hitting it
again w/the ansible cannon.

shane

On Tue, Nov 24, 2020 at 6:08 PM shane knapp ☠ <sknapp@berkeley.edu> wrote:

> all spark builds have been ported and triggered:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>
> not shown are the regular and k8s PRB, which are also running.
>
> i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but
> i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye on
> the build history of research-jenkins-worker-01 and -02, as they're running
> the latest OS + ansible (which will be moved in to the spark repo asap).
>
> i'm still concerned about sbt failures, which includes the PRB.  we'll see
> how things go, and just focus on getting things working on ubuntu 20 LTS.
> if we need to drop the ubuntu 16 workers from the pool temporarily, i would
> be more than happy to do that.  we'll lose some capacity, but it looks like
> we have a solid template for getting these suckers redeployed so
> turn-around should be pretty quick.
>
> we also need to dedicate some time to clean up/fix our plugin configs.
> there's been a lot of change over the past three years and things like PRB
> triggers seem flaky (it took 28m instead of 5m for this job to trigger:
> https://github.com/apache/spark/pull/29994)
>
> this all being said, i'm really happy w/our progress so far and have
> started leaning towards 'cautiously optimistic'...  we'll see how things go
> and recalibrate accordingly.  i'll have a better idea of where we are
> tomorrow and keep the list updated.
>
> and finally:  a HUGE thanks goes out to jon for the work going on at the
> colo this moment:  rack rearrangement, cleaning up networking, fixing
> hardware, reimaging and generally kicking ass!
>
> have a great holiday!
>
> shane
>
> On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠ <sknapp@berkeley.edu> wrote:
>
>> our very first ubuntu-based PRB is running:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/
>>
>> crossing my fingers!  :)
>>
>> On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠ <sknapp@berkeley.edu>
>> wrote:
>>
>>> due to scheduling, upcoming holiday and in-the-colo work requirements,
>>> all of the centos workers are being wiped NOW.
>>>
>>> this is great, as the sooner we can get started on fixing builds the
>>> better.  i'm not going anywhere over the holiday, so i'll get a good
>>> head-start on things.
>>>
>>> thank you jon!
>>>
>>> shane
>>>
>>> On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ <sknapp@berkeley.edu>
>>> wrote:
>>>
>>>> this is a lengthy, but important read for everyone here.
>>>>
>>>> in the next few days, the remaining centos machines (PRB/SBT workers
>>>> AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>>>>
>>>> this means three important things on the very near horizon:
>>>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
>>>> 2 -- jenkins itself will be down for a while as we move the jenkins
>>>> installation to it's new home.
>>>> 3 -- those of you with accounts here will temporarily lose access
>>>>
>>>> regarding (1), brian (cced) will be helping me debug and fix any
>>>> system-level bugs (python envs, missing packages, etc).  jon (cced) will
be
>>>> doing the reimaging and cobbling together of hardware to keep us on our
>>>> feet.  their help is going to be invaluable to getting us back on the
>>>> ground.
>>>>
>>>> we already have two ubuntu 20 workers up and building
>>>> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
>>>> is already green.  i'll keep an eye on these workers to ensure i didn't
>>>> miss anything.
>>>>
>>>> once we have a couple of more ubuntu 20 machines up, i'll move the PRB
>>>> and SBT builds there and let them fail as often as possible so we can use
>>>> the build logs during the migration of the primary.
>>>>
>>>> then we shut down jenkins and move to the new primary.
>>>>
>>>> this will all be happening in the next week to week-and-a-half.
>>>>
>>>> nearish on the horizon, we need to do two things:
>>>> 1 -- reimage the ubuntu 16 workers
>>>> 2 -- clean up the all of the breakages within jenkins plugin universe.
>>>> there's a lot of stacktraces everywhere after the upgrade, but things are
>>>> still building so i'm inclined to push this out.
>>>> 3 -- fix the PRB/SBT builds.
>>>>
>>>> further off, once we're stable, we (the spark community) will need to
>>>> have an honest conversation about where the build system lives.  we don't
>>>> currently have enough resources here to manage the system in a way that it
>>>> deserves, and i can't forsee getting the staffing for long-term support any
>>>> time soon.
>>>>
>>>> however, with the ansible configs (which i plan on moving to the spark
>>>> repo), it should be much easier to replicate the build system.
>>>>
>>>> by this time next year, i would like to have helped find the build
>>>> system a new home, and sunset jenkins.  over the past 11 years (i think),
>>>> this system has built spark.  it's getting a little tired and needs a well
>>>> deserved break.  :)
>>>>
>>>> shane
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Mime
View raw message