spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiao Li <lix...@databricks.com>
Subject Re: restarting jenkins build system tomorrow (7/8) ~930am PDT
Date Mon, 13 Jul 2020 17:17:53 GMT
Thank you very much, Shane!

Xiao

On Mon, Jul 13, 2020 at 10:15 AM shane knapp ☠ <sknapp@berkeley.edu> wrote:

> alright, the system load graphs show that we've had a generally decreasing
> load since friday, and have burned through ~3k builds/day since the reboot
> last week!  i don't see many timeouts, and the PRB builds have been
> generally green for a couple of days.
>
> again, i will keep an eye on things but i feel we're out of the woods
> right now.  :)
>
> shane
>
> On Fri, Jul 10, 2020 at 3:43 PM Frank Yin <ukby.1234@gmail.com> wrote:
>
>> Great. Thanks.
>>
>> On Fri, Jul 10, 2020 at 3:39 PM shane knapp ☠ <sknapp@berkeley.edu>
>> wrote:
>>
>>> no, 8 hours is plenty.  things will speed up soon once the backlog of
>>> builds works through....  i limited the number of PRB builds to 4 per
>>> worker, and things are looking better.  let's see how we look next week.
>>>
>>> On Fri, Jul 10, 2020 at 3:31 PM Frank Yin <ukby.1234@gmail.com> wrote:
>>>
>>>> Can we also increase the build timeout?
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617
>>>> This one fails because it times out, not because of test failures.
>>>>
>>>> On Fri, Jul 10, 2020 at 2:16 PM Frank Yin <ukby.1234@gmail.com> wrote:
>>>>
>>>>> Yeah, that's what I figured -- those workers are under load. Thanks.
>>>>>
>>>>> On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ <sknapp@berkeley.edu>
>>>>> wrote:
>>>>>
>>>>>> only 125561, 125562 and 125564 were impacted by -9.
>>>>>>
>>>>>> 125565 exited w/a code of 15 (143 - 128), which means the process
was
>>>>>> terminated for unknown reasons.
>>>>>>
>>>>>> 125563 looks like mima failed due to a bunch of errors.
>>>>>>
>>>>>> i just spot checked a bunch of recent failed PRB builds from today
>>>>>> and they all seemed to be legit.
>>>>>>
>>>>>> another thing that might be happening is an overload of PRB builds
on
>>>>>> the workers due to the backlog...  the workers are under a LOT of
load
>>>>>> right now, and i can put some rate limiting in to see if that helps
out.
>>>>>>
>>>>>> shane
>>>>>>
>>>>>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <ukby.1234@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Like from build number 125565 to 125561, all impacted by kill
-9.
>>>>>>>
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>>>>>
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console
>>>>>>>
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console
>>>>>>>
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console
>>>>>>>
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console
>>>>>>>
>>>>>>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <sknapp@berkeley.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> define "a lot" and provide some links to those builds, please.
>>>>>>>> there are roughly 2000 builds per day, and i can't do more
than keep a
>>>>>>>> cursory eye on things.
>>>>>>>>
>>>>>>>> the infrastructure that the tests run on hasn't changed one
bit on
>>>>>>>> any of the workers, and 'kill -9' could be a timeout, flakiness
caused by
>>>>>>>> old build processes remaining on the workers after the master
went down, or
>>>>>>>> me trying to clean things up w/o a reboot.  or, perhaps,
something wrong
>>>>>>>> w/the infra.  :)
>>>>>>>>
>>>>>>>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <ukby.1234@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Agree, but I’ve seen a lot of kill by signal 9, assuming
that
>>>>>>>>> infrastructure?
>>>>>>>>>
>>>>>>>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <sknapp@berkeley.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> yeah, i can't do much for flaky tests...  just flaky
>>>>>>>>>> infrastructure.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <
>>>>>>>>>> gurwls223@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Couple of flaky tests can happen. It's usual.
Seems it got
>>>>>>>>>>> better now at least. I will keep monitoring the
builds.
>>>>>>>>>>>
>>>>>>>>>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234
<ukby.1234@gmail.com>님이 작성:
>>>>>>>>>>>
>>>>>>>>>>>> Looks like Jenkins isn't stable still. My
PR fails two times in
>>>>>>>>>>>> a row:
>>>>>>>>>>>>
>>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>>>>>>>>>>
>>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Sent from:
>>>>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Shane Knapp
>>>>>>>>>> Computer Guy / Voice of Reason
>>>>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical
Lead
>>>>>>>>>> https://rise.cs.berkeley.edu
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Shane Knapp
>>>>>>>> Computer Guy / Voice of Reason
>>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>>>> https://rise.cs.berkeley.edu
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Shane Knapp
>>>>>> Computer Guy / Voice of Reason
>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>> https://rise.cs.berkeley.edu
>>>>>>
>>>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
<https://databricks.com/sparkaisummit/north-america>

Mime
View raw message