spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tianhua huang <>
Subject Re: Ask for ARM CI for spark
Date Thu, 19 Sep 2019 02:59:20 GMT
@Dongjoon Hyun <> ,

Sure, and I have update the JIRA already :)
If anything missed, please let me know, thank you.

On Thu, Sep 19, 2019 at 12:44 AM Dongjoon Hyun <>

> Hi, Tianhua.
> Could you summarize the detail on the JIRA once more?
> It will be very helpful for the community. Also, I've been waiting on that
> JIRA. :)
> Bests,
> Dongjoon.
> On Mon, Sep 16, 2019 at 11:48 PM Tianhua huang <>
> wrote:
>> @shane knapp <> thank you very much, I opened an
>> issue for this, we can
>> tall the details in it :)
>> And we will prepare an arm instance today and will send the info to your
>> email later.
>> On Tue, Sep 17, 2019 at 4:40 AM Shane Knapp <> wrote:
>>> @Tianhua huang <> sure, i think we can get
>>> something sorted for the short-term.
>>> all we need is ssh access (i can provide an ssh key), and i can then
>>> have our jenkins master launch a remote worker on that instance.
>>> instance setup, etc, will be up to you.  my support for the time being
>>> will be to create the job and 'best effort' for everything else.
>>> this should get us up and running asap.
>>> is there an open JIRA for jenkins/arm test support?  we can move the
>>> technical details about this idea there.
>>> On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang <>
>>> wrote:
>>>> @Sean Owen <> , so sorry to reply late, we had a
>>>> Mid-Autumn holiday:)
>>>> If you hope to integrate ARM CI to amplab jenkins, we can offer the arm
>>>> instance, and then the ARM job will run together with other x86 jobs, so
>>>> maybe there is a guideline to do this? @shane knapp
>>>> <>  would you help us?
>>>> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <> wrote:
>>>>> I don't know what's involved in actually accepting or operating those
>>>>> machines, so can't comment there, but in the meantime it's good that
>>>>> are running these tests and can help report changes needed to keep it
>>>>> working with ARM. I would continue with that for now.
>>>>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang <
>>>>>> wrote:
>>>>>> Hi all,
>>>>>> For the whole work process of spark ARM CI, we want to make 2 things
>>>>>> clear.
>>>>>> The first thing is:
>>>>>> About spark ARM CI, now we have two periodic jobs, one job[1] based
>>>>>> on commit[2](which already fixed the replay tests failed issue[3],
we made
>>>>>> a new test branch based on date 09-09-2019), the other job[4] based
>>>>>> spark master.
>>>>>> The first job we test on the specified branch to prove that our ARM
>>>>>> CI is good and stable.
>>>>>> The second job checks spark master every day, then we can find
>>>>>> whether the latest commits affect the ARM CI. According to the build
>>>>>> history and result, it shows that some problems are easier to find
on ARM
>>>>>> like SPARK-28770 <>,
>>>>>> and it also shows that we would make efforts to trace and figure
>>>>>> out, till now we have found and fixed several problems[5][6][7],
>>>>>> everyone of the community :). And we believe that ARM CI is very
>>>>>> right?
>>>>>> The second thing is:
>>>>>> We plan to run the jobs for a period of time, and you can see the
>>>>>> result and logs from 'build history' of the jobs console, if everything
>>>>>> goes well for one or two weeks could community accept the ARM CI?
or how
>>>>>> long the periodic jobs to run then our community could have enough
>>>>>> confidence to accept the ARM CI? As you suggested before, it's good
>>>>>> integrate ARM CI to amplab jenkins, we agree that and we can donate
the ARM
>>>>>> instances and then maintain the ARM-related test jobs together with
>>>>>> community, any thoughts?
>>>>>> Thank you all!
>>>>>> [1]
>>>>>> [2]
>>>>>> [3]
>>>>>> [4]
>>>>>> [5]
>>>>>> [6]
>>>>>> [7]
>>>>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <>
>>>>>>> Yes, I think it's just local caching. After you run the build
>>>>>>> should find lots of stuff cached at ~/.m2/repository and it won't
>>>>>>> every time.
>>>>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo <
>>>>>>>> wrote:
>>>>>>>> Hi Sean,
>>>>>>>> Thanks for reply. And very apologize for making you confused.
>>>>>>>> I know the dependencies will be downloaded from SBT or Maven.
>>>>>>>> the Spark QA job also exec "mvn clean package", why the log
didn't print
>>>>>>>> "downloading some jar from Maven central [1] and build very
fast. Is the
>>>>>>>> reason that Spark Jenkins build the Spark jars in the physical
>>>>>>>> and won't destrory the test env after job is finished? Then
the other job
>>>>>>>> build Spark will get the dependencies jar from the local
cached, as the
>>>>>>>> previous jobs exec "mvn package", those dependencies had
been downloaded
>>>>>>>> already on local worker machine. Am I right? Is that the
reason the job
>>>>>>>> log[1] didn't print any downloading information from Maven
>>>>>>>> Thank you very much.
>>>>>>>> [1]
>>>>>>>> Best regards
>>>>>>>> ZhaoBo
>>>>>>>> [image: Mailtrack]
>>>>>>>> <>
>>>>>>>> notified by
>>>>>>>> Mailtrack
>>>>>>>> <>
>>>>>>>> 下午03:58:53
>>>>>>>> Sean Owen <> 于2019年8月16日周五
>>>>>>>>> I'm not sure what you mean. The dependencies are downloaded
by SBT
>>>>>>>>> and Maven like in any other project, and nothing about
it is specific to
>>>>>>>>> Spark.
>>>>>>>>> The worker machines cache artifacts that are downloaded
>>>>>>>>> these, but this is a function of Maven and SBT, not Spark.
You may find
>>>>>>>>> that the initial download takes a long time.
>>>>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo <
>>>>>>>>>> wrote:
>>>>>>>>>> Hi Sean,
>>>>>>>>>> Thanks very much for pointing out the roadmap. ;-).
Then I think
>>>>>>>>>> we will continue to focus on our test environment.
>>>>>>>>>> For the networking problems, I mean that we can access
>>>>>>>>>> Central, and jobs cloud download the required jar
package with a high
>>>>>>>>>> network speed. What we want to know is that, why
the Spark QA test jobs[1]
>>>>>>>>>> log shows the job script/maven build seem don't download
the jar packages?
>>>>>>>>>> Could you tell us the reason about that? Thank you.
 The reason we raise
>>>>>>>>>> the "networking problems" is that we found a phenomenon
during we test, if
>>>>>>>>>> we execute "mvn clean package" in a new test environment(As
in our test
>>>>>>>>>> environment, we will destory the test VMs after the
job is finish), maven
>>>>>>>>>> will download the dependency jar packages from Maven
Central, but in this
>>>>>>>>>> job "spark-master-test-maven-hadoop" [2], from the
log, we didn't found it
>>>>>>>>>> download any jar packages, what the reason about
>>>>>>>>>> Also we build the Spark jar with downloading dependencies
>>>>>>>>>> Maven Central, it will cost mostly 1 hour. And we
found [2] just cost
>>>>>>>>>> 10min. But if we run "mvn package" in a VM which
already exec "mvn package"
>>>>>>>>>> before, it just cost 14min, looks very closer with
[2]. So we suspect that
>>>>>>>>>> downloading the Jar packages cost so much time. For
the goad of ARM CI, we
>>>>>>>>>> expect the performance of NEW ARM CI could be closer
with existing X86 CI,
>>>>>>>>>> then users could accept it eaiser.
>>>>>>>>>> [1]
>>>>>>>>>> [2]
>>>>>>>>>> Best regards
>>>>>>>>>> ZhaoBo
>>>>>>>>>> [image: Mailtrack]
>>>>>>>>>> <>
>>>>>>>>>> notified by
>>>>>>>>>> Mailtrack
>>>>>>>>>> <>
>>>>>>>>>> 上午09:48:43
>>>>>>>>>> Sean Owen <> 于2019年8月15日周四
>>>>>>>>>>> I think the right goal is to fix the remaining
issues first. If
>>>>>>>>>>> we set up CI/CD it will only tell us there are
still some test failures. If
>>>>>>>>>>> it's stable, and not hard to add to the existing
CI/CD, yes it could be
>>>>>>>>>>> done automatically later. You can continue to
test on ARM independently for
>>>>>>>>>>> now.
>>>>>>>>>>> It sounds indeed like there are some networking
problems in the
>>>>>>>>>>> test system if you're not able to download from
Maven Central. That rarely
>>>>>>>>>>> takes significant time, and there aren't project-specific
mirrors here. You
>>>>>>>>>>> might be able to point at a closer public mirror,
depending on where you
>>>>>>>>>>> are.
>>>>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> I want to discuss spark ARM CI again, we
took some tests on arm
>>>>>>>>>>>> instance based on master and the job includes
 and k8s
>>>>>>>>>>>> integration
>>>>>>>>>>>> there are several things I want to talk about:
>>>>>>>>>>>> First, about the failed tests:
>>>>>>>>>>>>     1.we have fixed some problems like
thanks sean owen
>>>>>>>>>>>> and others to help us.
>>>>>>>>>>>>     2.we tried k8s integration test on arm,
and met an error:
>>>>>>>>>>>> apk fetch hangs,  the tests passed  after
adding '--network host' option
>>>>>>>>>>>> for command `docker build`, see:
>>>>>>>>>>>> , the solution refers to
 and I
>>>>>>>>>>>> don't know whether it happened once in community
CI, or maybe we should
>>>>>>>>>>>> submit a pr to pass  '--network host' when
`docker build`?
>>>>>>>>>>>>     3.we found there are two tests failed
after the commit
>>>>>>>>>>>>        ReplayListenerSuite:
>>>>>>>>>>>>        - ...
>>>>>>>>>>>>        - End-to-end replay *** FAILED ***
>>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>>        - End-to-end replay with compression
*** FAILED ***
>>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>>         we tried to revert the commit and
then the tests
>>>>>>>>>>>> passed, the patch is too big and so sorry
we can't find the reason till
>>>>>>>>>>>> now, if you are interesting please try it,
and it will be very appreciate
>>>>>>>>>>>>         if someone can help us to figure
it out.
>>>>>>>>>>>> Second, about the test time, we increased
the flavor of arm
>>>>>>>>>>>> instance to 16U16G, but seems there was no
significant improvement, the k8s
>>>>>>>>>>>> integration test took about one and a half
hours, and the QA test(like
>>>>>>>>>>>> spark-master-test-maven-hadoop-2.7 community
jenkins job) took about
>>>>>>>>>>>> seventeen hours(it is too long :(), we suspect
that the reason is the
>>>>>>>>>>>> performance and network,
>>>>>>>>>>>> we split the jobs based on projects such
as sql, core and so
>>>>>>>>>>>> on, the time can be decrease to about seven
hours, see
We found the Spark
>>>>>>>>>>>> QA tests like
>>>>>>>>>>>> it looks all tests seem never download the
jar packages from maven centry
>>>>>>>>>>>> repo(such as
>>>>>>>>>>>> So we want to know how the jenkins jobs can
do that, is there a internal
>>>>>>>>>>>> maven repo launched? maybe we can do the
same thing to avoid the network
>>>>>>>>>>>> connection cost during downloading the dependent
jar packages.
>>>>>>>>>>>> Third, the most important thing, it's about
ARM CI of spark, we
>>>>>>>>>>>> believe that it is necessary, right? And
you can see we really made a lot
>>>>>>>>>>>> of efforts, now the basic arm build/test
jobs is ok, so we suggest to add
>>>>>>>>>>>> arm jobs to community, we can set them to
novoting firstly, and
>>>>>>>>>>>> improve/rich the jobs step by step. Generally,
there are two ways in our
>>>>>>>>>>>> mind to integrate the ARM CI for spark:
>>>>>>>>>>>>      1) We introduce openlab ARM CI into
spark as a custom CI
>>>>>>>>>>>> system. We provide human resources and test
ARM VMs, also we will focus on
>>>>>>>>>>>> the ARM related issues about Spark. We will
push the PR into community.
>>>>>>>>>>>>      2) We donate ARM VM resources into existing
>>>>>>>>>>>> Jenkins. We still provide human resources,
focus on the ARM related issues
>>>>>>>>>>>> about Spark and push the PR into community.
>>>>>>>>>>>> Both options, we will provide human resources
to maintain, of
>>>>>>>>>>>> course it will be great if we can work together.
So please tell us which
>>>>>>>>>>>> option you would like? And let's move forward.
Waiting for your reply,
>>>>>>>>>>>> thank you very much.
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead

View raw message