spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tianhua huang <huangtianhua...@gmail.com>
Subject Re: Ask for ARM CI for spark
Date Thu, 19 Sep 2019 02:59:20 GMT
@Dongjoon Hyun <dongjoon.hyun@gmail.com> ,

Sure, and I have update the JIRA already :)
https://issues.apache.org/jira/browse/SPARK-29106
If anything missed, please let me know, thank you.

On Thu, Sep 19, 2019 at 12:44 AM Dongjoon Hyun <dongjoon.hyun@gmail.com>
wrote:

> Hi, Tianhua.
>
> Could you summarize the detail on the JIRA once more?
> It will be very helpful for the community. Also, I've been waiting on that
> JIRA. :)
>
> Bests,
> Dongjoon.
>
>
> On Mon, Sep 16, 2019 at 11:48 PM Tianhua huang <huangtianhua223@gmail.com>
> wrote:
>
>> @shane knapp <sknapp@berkeley.edu> thank you very much, I opened an
>> issue for this https://issues.apache.org/jira/browse/SPARK-29106, we can
>> tall the details in it :)
>> And we will prepare an arm instance today and will send the info to your
>> email later.
>>
>> On Tue, Sep 17, 2019 at 4:40 AM Shane Knapp <sknapp@berkeley.edu> wrote:
>>
>>> @Tianhua huang <huangtianhua223@gmail.com> sure, i think we can get
>>> something sorted for the short-term.
>>>
>>> all we need is ssh access (i can provide an ssh key), and i can then
>>> have our jenkins master launch a remote worker on that instance.
>>>
>>> instance setup, etc, will be up to you.  my support for the time being
>>> will be to create the job and 'best effort' for everything else.
>>>
>>> this should get us up and running asap.
>>>
>>> is there an open JIRA for jenkins/arm test support?  we can move the
>>> technical details about this idea there.
>>>
>>> On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang <huangtianhua223@gmail.com>
>>> wrote:
>>>
>>>> @Sean Owen <srowen@gmail.com> , so sorry to reply late, we had a
>>>> Mid-Autumn holiday:)
>>>>
>>>> If you hope to integrate ARM CI to amplab jenkins, we can offer the arm
>>>> instance, and then the ARM job will run together with other x86 jobs, so
>>>> maybe there is a guideline to do this? @shane knapp
>>>> <sknapp@berkeley.edu>  would you help us?
>>>>
>>>> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <srowen@gmail.com> wrote:
>>>>
>>>>> I don't know what's involved in actually accepting or operating those
>>>>> machines, so can't comment there, but in the meantime it's good that
you
>>>>> are running these tests and can help report changes needed to keep it
>>>>> working with ARM. I would continue with that for now.
>>>>>
>>>>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang <
>>>>> huangtianhua223@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> For the whole work process of spark ARM CI, we want to make 2 things
>>>>>> clear.
>>>>>>
>>>>>> The first thing is:
>>>>>> About spark ARM CI, now we have two periodic jobs, one job[1] based
>>>>>> on commit[2](which already fixed the replay tests failed issue[3],
we made
>>>>>> a new test branch based on date 09-09-2019), the other job[4] based
on
>>>>>> spark master.
>>>>>>
>>>>>> The first job we test on the specified branch to prove that our ARM
>>>>>> CI is good and stable.
>>>>>> The second job checks spark master every day, then we can find
>>>>>> whether the latest commits affect the ARM CI. According to the build
>>>>>> history and result, it shows that some problems are easier to find
on ARM
>>>>>> like SPARK-28770 <https://issues.apache.org/jira/browse/SPARK-28770>,
>>>>>> and it also shows that we would make efforts to trace and figure
them
>>>>>> out, till now we have found and fixed several problems[5][6][7],
thanks
>>>>>> everyone of the community :). And we believe that ARM CI is very
necessary,
>>>>>> right?
>>>>>>
>>>>>> The second thing is:
>>>>>> We plan to run the jobs for a period of time, and you can see the
>>>>>> result and logs from 'build history' of the jobs console, if everything
>>>>>> goes well for one or two weeks could community accept the ARM CI?
or how
>>>>>> long the periodic jobs to run then our community could have enough
>>>>>> confidence to accept the ARM CI? As you suggested before, it's good
to
>>>>>> integrate ARM CI to amplab jenkins, we agree that and we can donate
the ARM
>>>>>> instances and then maintain the ARM-related test jobs together with
>>>>>> community, any thoughts?
>>>>>>
>>>>>> Thank you all!
>>>>>>
>>>>>> [1]
>>>>>> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64
>>>>>> [2]
>>>>>> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72
>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-28770
>>>>>> [4]
>>>>>> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64
>>>>>> [5] https://github.com/apache/spark/pull/25186
>>>>>> [6] https://github.com/apache/spark/pull/25279
>>>>>> [7] https://github.com/apache/spark/pull/25673
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <srowen@gmail.com>
wrote:
>>>>>>
>>>>>>> Yes, I think it's just local caching. After you run the build
you
>>>>>>> should find lots of stuff cached at ~/.m2/repository and it won't
download
>>>>>>> every time.
>>>>>>>
>>>>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo <
>>>>>>> bzhaojyathousandy@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Sean,
>>>>>>>> Thanks for reply. And very apologize for making you confused.
>>>>>>>> I know the dependencies will be downloaded from SBT or Maven.
But
>>>>>>>> the Spark QA job also exec "mvn clean package", why the log
didn't print
>>>>>>>> "downloading some jar from Maven central [1] and build very
fast. Is the
>>>>>>>> reason that Spark Jenkins build the Spark jars in the physical
machiines
>>>>>>>> and won't destrory the test env after job is finished? Then
the other job
>>>>>>>> build Spark will get the dependencies jar from the local
cached, as the
>>>>>>>> previous jobs exec "mvn package", those dependencies had
been downloaded
>>>>>>>> already on local worker machine. Am I right? Is that the
reason the job
>>>>>>>> log[1] didn't print any downloading information from Maven
Central?
>>>>>>>>
>>>>>>>> Thank you very much.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>> ZhaoBo
>>>>>>>>
>>>>>>>> [image: Mailtrack]
>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
Sender
>>>>>>>> notified by
>>>>>>>> Mailtrack
>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
19/08/16
>>>>>>>> 下午03:58:53
>>>>>>>>
>>>>>>>> Sean Owen <srowen@gmail.com> 于2019年8月16日周五
上午10:38写道:
>>>>>>>>
>>>>>>>>> I'm not sure what you mean. The dependencies are downloaded
by SBT
>>>>>>>>> and Maven like in any other project, and nothing about
it is specific to
>>>>>>>>> Spark.
>>>>>>>>> The worker machines cache artifacts that are downloaded
from
>>>>>>>>> these, but this is a function of Maven and SBT, not Spark.
You may find
>>>>>>>>> that the initial download takes a long time.
>>>>>>>>>
>>>>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo <
>>>>>>>>> bzhaojyathousandy@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Sean,
>>>>>>>>>>
>>>>>>>>>> Thanks very much for pointing out the roadmap. ;-).
Then I think
>>>>>>>>>> we will continue to focus on our test environment.
>>>>>>>>>>
>>>>>>>>>> For the networking problems, I mean that we can access
Maven
>>>>>>>>>> Central, and jobs cloud download the required jar
package with a high
>>>>>>>>>> network speed. What we want to know is that, why
the Spark QA test jobs[1]
>>>>>>>>>> log shows the job script/maven build seem don't download
the jar packages?
>>>>>>>>>> Could you tell us the reason about that? Thank you.
 The reason we raise
>>>>>>>>>> the "networking problems" is that we found a phenomenon
during we test, if
>>>>>>>>>> we execute "mvn clean package" in a new test environment(As
in our test
>>>>>>>>>> environment, we will destory the test VMs after the
job is finish), maven
>>>>>>>>>> will download the dependency jar packages from Maven
Central, but in this
>>>>>>>>>> job "spark-master-test-maven-hadoop" [2], from the
log, we didn't found it
>>>>>>>>>> download any jar packages, what the reason about
that?
>>>>>>>>>> Also we build the Spark jar with downloading dependencies
from
>>>>>>>>>> Maven Central, it will cost mostly 1 hour. And we
found [2] just cost
>>>>>>>>>> 10min. But if we run "mvn package" in a VM which
already exec "mvn package"
>>>>>>>>>> before, it just cost 14min, looks very closer with
[2]. So we suspect that
>>>>>>>>>> downloading the Jar packages cost so much time. For
the goad of ARM CI, we
>>>>>>>>>> expect the performance of NEW ARM CI could be closer
with existing X86 CI,
>>>>>>>>>> then users could accept it eaiser.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/
>>>>>>>>>> [2]
>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>>>>
>>>>>>>>>> Best regards
>>>>>>>>>>
>>>>>>>>>> ZhaoBo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [image: Mailtrack]
>>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
Sender
>>>>>>>>>> notified by
>>>>>>>>>> Mailtrack
>>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
19/08/16
>>>>>>>>>> 上午09:48:43
>>>>>>>>>>
>>>>>>>>>> Sean Owen <srowen@gmail.com> 于2019年8月15日周四
下午9:58写道:
>>>>>>>>>>
>>>>>>>>>>> I think the right goal is to fix the remaining
issues first. If
>>>>>>>>>>> we set up CI/CD it will only tell us there are
still some test failures. If
>>>>>>>>>>> it's stable, and not hard to add to the existing
CI/CD, yes it could be
>>>>>>>>>>> done automatically later. You can continue to
test on ARM independently for
>>>>>>>>>>> now.
>>>>>>>>>>>
>>>>>>>>>>> It sounds indeed like there are some networking
problems in the
>>>>>>>>>>> test system if you're not able to download from
Maven Central. That rarely
>>>>>>>>>>> takes significant time, and there aren't project-specific
mirrors here. You
>>>>>>>>>>> might be able to point at a closer public mirror,
depending on where you
>>>>>>>>>>> are.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang
<
>>>>>>>>>>> huangtianhua223@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I want to discuss spark ARM CI again, we
took some tests on arm
>>>>>>>>>>>> instance based on master and the job includes
>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/13
 and k8s
>>>>>>>>>>>> integration https://github.com/theopenlab/spark/pull/17/
,
>>>>>>>>>>>> there are several things I want to talk about:
>>>>>>>>>>>>
>>>>>>>>>>>> First, about the failed tests:
>>>>>>>>>>>>     1.we have fixed some problems like
>>>>>>>>>>>> https://github.com/apache/spark/pull/25186
and
>>>>>>>>>>>> https://github.com/apache/spark/pull/25279,
thanks sean owen
>>>>>>>>>>>> and others to help us.
>>>>>>>>>>>>     2.we tried k8s integration test on arm,
and met an error:
>>>>>>>>>>>> apk fetch hangs,  the tests passed  after
adding '--network host' option
>>>>>>>>>>>> for command `docker build`, see:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176
>>>>>>>>>>>> , the solution refers to
>>>>>>>>>>>> https://github.com/gliderlabs/docker-alpine/issues/307
 and I
>>>>>>>>>>>> don't know whether it happened once in community
CI, or maybe we should
>>>>>>>>>>>> submit a pr to pass  '--network host' when
`docker build`?
>>>>>>>>>>>>     3.we found there are two tests failed
after the commit
>>>>>>>>>>>> https://github.com/apache/spark/pull/23767
 :
>>>>>>>>>>>>        ReplayListenerSuite:
>>>>>>>>>>>>        - ...
>>>>>>>>>>>>        - End-to-end replay *** FAILED ***
>>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>>        - End-to-end replay with compression
*** FAILED ***
>>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>>
>>>>>>>>>>>>         we tried to revert the commit and
then the tests
>>>>>>>>>>>> passed, the patch is too big and so sorry
we can't find the reason till
>>>>>>>>>>>> now, if you are interesting please try it,
and it will be very appreciate
>>>>>>>>>>>>         if someone can help us to figure
it out.
>>>>>>>>>>>>
>>>>>>>>>>>> Second, about the test time, we increased
the flavor of arm
>>>>>>>>>>>> instance to 16U16G, but seems there was no
significant improvement, the k8s
>>>>>>>>>>>> integration test took about one and a half
hours, and the QA test(like
>>>>>>>>>>>> spark-master-test-maven-hadoop-2.7 community
jenkins job) took about
>>>>>>>>>>>> seventeen hours(it is too long :(), we suspect
that the reason is the
>>>>>>>>>>>> performance and network,
>>>>>>>>>>>> we split the jobs based on projects such
as sql, core and so
>>>>>>>>>>>> on, the time can be decrease to about seven
hours, see
>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/19
We found the Spark
>>>>>>>>>>>> QA tests like
>>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/
  ,
>>>>>>>>>>>> it looks all tests seem never download the
jar packages from maven centry
>>>>>>>>>>>> repo(such as
>>>>>>>>>>>> https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar).
>>>>>>>>>>>> So we want to know how the jenkins jobs can
do that, is there a internal
>>>>>>>>>>>> maven repo launched? maybe we can do the
same thing to avoid the network
>>>>>>>>>>>> connection cost during downloading the dependent
jar packages.
>>>>>>>>>>>>
>>>>>>>>>>>> Third, the most important thing, it's about
ARM CI of spark, we
>>>>>>>>>>>> believe that it is necessary, right? And
you can see we really made a lot
>>>>>>>>>>>> of efforts, now the basic arm build/test
jobs is ok, so we suggest to add
>>>>>>>>>>>> arm jobs to community, we can set them to
novoting firstly, and
>>>>>>>>>>>> improve/rich the jobs step by step. Generally,
there are two ways in our
>>>>>>>>>>>> mind to integrate the ARM CI for spark:
>>>>>>>>>>>>      1) We introduce openlab ARM CI into
spark as a custom CI
>>>>>>>>>>>> system. We provide human resources and test
ARM VMs, also we will focus on
>>>>>>>>>>>> the ARM related issues about Spark. We will
push the PR into community.
>>>>>>>>>>>>      2) We donate ARM VM resources into existing
amplab
>>>>>>>>>>>> Jenkins. We still provide human resources,
focus on the ARM related issues
>>>>>>>>>>>> about Spark and push the PR into community.
>>>>>>>>>>>> Both options, we will provide human resources
to maintain, of
>>>>>>>>>>>> course it will be great if we can work together.
So please tell us which
>>>>>>>>>>>> option you would like? And let's move forward.
Waiting for your reply,
>>>>>>>>>>>> thank you very much.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>

Mime
View raw message