spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xin Lu <...@salesforce.com.INVALID>
Subject Re: Raise Jenkins test timeout? with alternatives
Date Thu, 11 Apr 2019 18:18:58 GMT
Yes it is worth trying just running one jenkins job on each machine, but I
remember at Databricks we actually did just run one job per machine and the
spark tests still took hours.  We used quite large ec2 instances, too.  Now
two years later the number of tests probably increased.

Xin

On Thu, Apr 11, 2019 at 11:15 AM Sean Owen <srowen@gmail.com> wrote:

> If the machines are bottlenecked on I/O or are swapping, doing less work
> concurrently would improve throughput, and parallelizing wouldn't. I don't
> know that it's the case, but am wondering out loud as the runtimes seem to
> vary by 20-30% sometimes. Naturally, having the option to parallelize is
> good as well, if those bottlenecks aren't actually a problem or are
> resolved otherwise.
>
> On Thu, Apr 11, 2019 at 1:10 PM Xin Lu <xlu@salesforce.com> wrote:
>
>> I worked on parallelizing the tests two years ago.  It does require an
>> update to the amplab jenkins, which is very old (1.651.3 released
>> 2016-7-1).  The current  version of cloudbees jenkins has stages and it is
>> not difficult to put tests in parallel stages and aggregate the test
>> results.  Reducing concurrent builds per machine would not resolve just the
>> sheer length of tests running serially and the number of PRs.
>>
>> Xin
>>
>> On Thu, Apr 11, 2019 at 10:53 AM Sean Owen <srowen@gmail.com> wrote:
>>
>>> Agree, and I can make a few of the ML regression tests faster pretty
>>> easily. Here the issue is more about what happens when you run every single
>>> test, and man that does take a long time. Maybe rare enough to not justify
>>> upping the build timeout. (The PR passed just barely this time anyway)
>>>
>>> Q for Shane: we have a ton of build slots, but it seems like worker
>>> performance does slow down when there are multiple builds in progress. Is
>>> there any value in reducing the number of concurrent builds per machine,
>>> esp if we're not really using all of it? might help load balance more or
>>> something. I was also trying to figure out if they were swapping or
>>> something but couldn't find an easy way to tell.
>>>
>>> On Thu, Apr 11, 2019 at 11:00 AM Xiao Li <lixiao@databricks.com> wrote:
>>>
>>>> Hi, Sean
>>>>
>>>> Your issue actually shows our existing test frameworks needs a change
>>>> ASAP. We need to go over the tests listed in
>>>> https://spark-tests.appspot.com/slow-tests and see whether we can
>>>> reduce the time or run these test suites in parallel.
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>>
>>>>
>>>> On Thu, Apr 11, 2019 at 4:26 AM Sean Owen <srowen@gmail.com> wrote:
>>>>
>>>>> I have a big PR that keeps failing because it his the 300 minute build
>>>>> timeout:
>>>>>
>>>>> https://github.com/apache/spark/pull/24314
>>>>>
>>>>> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4703/console
>>>>>
>>>>> It's because it touches so much code that all tests run including
>>>>> things like Kinesis. It looks like 300 mins isn't enough. We can raise
>>>>> it to an eye-watering 360 minutes if that's just how long all tests
>>>>> take.
>>>>>
>>>>> I can also try splitting up the change to move out changes to a few
>>>>> optional modules into separate PRs.
>>>>>
>>>>> (Because this one makes it all the way through Python and Java tests
>>>>> and almost all R tests several times, and doesn't touch Python or R
>>>>> and shouldn't have any functional changes, I'm tempted to just merge
>>>>> it, too, as a solution)
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> [image:
>>>> https://databricks.com/sparkaisummit/north-america?utm_source=email&utm_medium=signature]
>>>>
>>>

Mime
View raw message