spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [build system] IMPORTANT UPDATE
Date Thu, 26 Nov 2020 03:26:44 GMT
Thanks Shane.

On Thu, 26 Nov 2020, 10:19 shane knapp ☠, <sknapp@berkeley.edu> wrote:

> alright, builds are looking solid except for SBT...  if someone here could
> take a look at those failures i'd be most appreciative.
>
> the important ones:  PRB, PRB-K8s, k8s, snapshot and maven builds all
> green!
>
> i'm literally gobsmacked by how smoothly this went.  :)
>
> we're all going to enjoy a mellow holiday and i'll check build statuses
> every now and then and see if i find anything else like this:
> https://issues.apache.org/jira/browse/SPARK-33565
>
> have a great holiday everyone!  we'll start getting the new primary set up
> on monday, and hopefully by tuesday be fully up and running.
>
> shane
>
>
> On Wed, Nov 25, 2020 at 1:35 PM shane knapp ☠ <sknapp@berkeley.edu> wrote:
>
>> hey all, work is going quite well and smoothly for this project.
>>
>> today's update:
>>
>> we will experience significant downtime monday/tuesday as we spin up the
>> new primary jenkins node.  until then, we'll be building over the next few
>> days so i'll have a chance to better track down and fix any system-level
>> build breaks.
>>
>> but most importantly, i just added 3 of the 4 new ubuntu 20.04 workers to
>> the pool:  research-jenkins-worker-03, 04 and 06.  -05 is being difficult,
>> so i'm going to let it pout in the corner for a while before hitting it
>> again w/the ansible cannon.
>>
>> shane
>>
>> On Tue, Nov 24, 2020 at 6:08 PM shane knapp ☠ <sknapp@berkeley.edu>
>> wrote:
>>
>>> all spark builds have been ported and triggered:
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>
>>> not shown are the regular and k8s PRB, which are also running.
>>>
>>> i think i've nailed down most of the stupid PATH and JAVA_HOME issues,
>>> but i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye
>>> on the build history of research-jenkins-worker-01 and -02, as they're
>>> running the latest OS + ansible (which will be moved in to the spark repo
>>> asap).
>>>
>>> i'm still concerned about sbt failures, which includes the PRB.  we'll
>>> see how things go, and just focus on getting things working on ubuntu 20
>>> LTS.  if we need to drop the ubuntu 16 workers from the pool temporarily, i
>>> would be more than happy to do that.  we'll lose some capacity, but it
>>> looks like we have a solid template for getting these suckers redeployed so
>>> turn-around should be pretty quick.
>>>
>>> we also need to dedicate some time to clean up/fix our plugin configs.
>>> there's been a lot of change over the past three years and things like PRB
>>> triggers seem flaky (it took 28m instead of 5m for this job to trigger:
>>> https://github.com/apache/spark/pull/29994)
>>>
>>> this all being said, i'm really happy w/our progress so far and have
>>> started leaning towards 'cautiously optimistic'...  we'll see how things go
>>> and recalibrate accordingly.  i'll have a better idea of where we are
>>> tomorrow and keep the list updated.
>>>
>>> and finally:  a HUGE thanks goes out to jon for the work going on at the
>>> colo this moment:  rack rearrangement, cleaning up networking, fixing
>>> hardware, reimaging and generally kicking ass!
>>>
>>> have a great holiday!
>>>
>>> shane
>>>
>>> On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠ <sknapp@berkeley.edu>
>>> wrote:
>>>
>>>> our very first ubuntu-based PRB is running:
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/
>>>>
>>>> crossing my fingers!  :)
>>>>
>>>> On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠ <sknapp@berkeley.edu>
>>>> wrote:
>>>>
>>>>> due to scheduling, upcoming holiday and in-the-colo work requirements,
>>>>> all of the centos workers are being wiped NOW.
>>>>>
>>>>> this is great, as the sooner we can get started on fixing builds the
>>>>> better.  i'm not going anywhere over the holiday, so i'll get a good
>>>>> head-start on things.
>>>>>
>>>>> thank you jon!
>>>>>
>>>>> shane
>>>>>
>>>>> On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ <sknapp@berkeley.edu>
>>>>> wrote:
>>>>>
>>>>>> this is a lengthy, but important read for everyone here.
>>>>>>
>>>>>> in the next few days, the remaining centos machines (PRB/SBT workers
>>>>>> AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>>>>>>
>>>>>> this means three important things on the very near horizon:
>>>>>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
>>>>>> 2 -- jenkins itself will be down for a while as we move the jenkins
>>>>>> installation to it's new home.
>>>>>> 3 -- those of you with accounts here will temporarily lose access
>>>>>>
>>>>>> regarding (1), brian (cced) will be helping me debug and fix any
>>>>>> system-level bugs (python envs, missing packages, etc).  jon (cced)
will be
>>>>>> doing the reimaging and cobbling together of hardware to keep us
on our
>>>>>> feet.  their help is going to be invaluable to getting us back on
the
>>>>>> ground.
>>>>>>
>>>>>> we already have two ubuntu 20 workers up and building
>>>>>> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s
build
>>>>>> is already green.  i'll keep an eye on these workers to ensure i
didn't
>>>>>> miss anything.
>>>>>>
>>>>>> once we have a couple of more ubuntu 20 machines up, i'll move the
>>>>>> PRB and SBT builds there and let them fail as often as possible so
we can
>>>>>> use the build logs during the migration of the primary.
>>>>>>
>>>>>> then we shut down jenkins and move to the new primary.
>>>>>>
>>>>>> this will all be happening in the next week to week-and-a-half.
>>>>>>
>>>>>> nearish on the horizon, we need to do two things:
>>>>>> 1 -- reimage the ubuntu 16 workers
>>>>>> 2 -- clean up the all of the breakages within jenkins plugin
>>>>>> universe.  there's a lot of stacktraces everywhere after the upgrade,
but
>>>>>> things are still building so i'm inclined to push this out.
>>>>>> 3 -- fix the PRB/SBT builds.
>>>>>>
>>>>>> further off, once we're stable, we (the spark community) will need
to
>>>>>> have an honest conversation about where the build system lives. 
we don't
>>>>>> currently have enough resources here to manage the system in a way
that it
>>>>>> deserves, and i can't forsee getting the staffing for long-term support
any
>>>>>> time soon.
>>>>>>
>>>>>> however, with the ansible configs (which i plan on moving to the
>>>>>> spark repo), it should be much easier to replicate the build system.
>>>>>>
>>>>>> by this time next year, i would like to have helped find the build
>>>>>> system a new home, and sunset jenkins.  over the past 11 years (i
think),
>>>>>> this system has built spark.  it's getting a little tired and needs
a well
>>>>>> deserved break.  :)
>>>>>>
>>>>>> shane
>>>>>> --
>>>>>> Shane Knapp
>>>>>> Computer Guy / Voice of Reason
>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>> https://rise.cs.berkeley.edu
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> Computer Guy / Voice of Reason
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Mime
View raw message