spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tianhua huang <huangtianhua...@gmail.com>
Subject Re: Ask for ARM CI for spark
Date Thu, 15 Aug 2019 10:43:33 GMT
Hi all,

I want to discuss spark ARM CI again, we took some tests on arm instance
based on master and the job includes
https://github.com/theopenlab/spark/pull/13  and k8s integration
https://github.com/theopenlab/spark/pull/17/ , there are several things I
want to talk about:

First, about the failed tests:
    1.we have fixed some problems like
https://github.com/apache/spark/pull/25186 and
https://github.com/apache/spark/pull/25279, thanks sean owen and others to
help us.
    2.we tried k8s integration test on arm, and met an error: apk fetch
hangs,  the tests passed  after adding '--network host' option for command
`docker build`, see:

https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176
, the solution refers to
https://github.com/gliderlabs/docker-alpine/issues/307  and I don't know
whether it happened once in community CI, or maybe we should submit a pr to
pass  '--network host' when `docker build`?
    3.we found there are two tests failed after the commit
https://github.com/apache/spark/pull/23767  :
       ReplayListenerSuite:
       - ...
       - End-to-end replay *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       - End-to-end replay with compression *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)

        we tried to revert the commit and then the tests passed, the patch
is too big and so sorry we can't find the reason till now, if you are
interesting please try it, and it will be very appreciate          if
someone can help us to figure it out.

Second, about the test time, we increased the flavor of arm instance to
16U16G, but seems there was no significant improvement, the k8s integration
test took about one and a half hours, and the QA test(like
spark-master-test-maven-hadoop-2.7 community jenkins job) took about
seventeen hours(it is too long :(), we suspect that the reason is the
performance and network,
we split the jobs based on projects such as sql, core and so on, the time
can be decrease to about seven hours, see
https://github.com/theopenlab/spark/pull/19 We found the Spark QA tests
like  https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   , it
looks all tests seem never download the jar packages from maven centry
repo(such as
https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar).
So we want to know how the jenkins jobs can do that, is there a internal
maven repo launched? maybe we can do the same thing to avoid the network
connection cost during downloading the dependent jar packages.

Third, the most important thing, it's about ARM CI of spark, we believe
that it is necessary, right? And you can see we really made a lot of
efforts, now the basic arm build/test jobs is ok, so we suggest to add arm
jobs to community, we can set them to novoting firstly, and improve/rich
the jobs step by step. Generally, there are two ways in our mind to
integrate the ARM CI for spark:
     1) We introduce openlab ARM CI into spark as a custom CI system. We
provide human resources and test ARM VMs, also we will focus on the ARM
related issues about Spark. We will push the PR into community.
     2) We donate ARM VM resources into existing amplab Jenkins. We still
provide human resources, focus on the ARM related issues about Spark and
push the PR into community.
Both options, we will provide human resources to maintain, of course it
will be great if we can work together. So please tell us which option you
would like? And let's move forward. Waiting for your reply, thank you very
much.

On Wed, Aug 14, 2019 at 10:30 AM Tianhua huang <huangtianhua223@gmail.com>
wrote:

> OK, thanks.
>
> On Tue, Aug 13, 2019 at 8:37 PM Sean Owen <srowen@gmail.com> wrote:
>
>> -dev@ -- it's better not to send to the whole list to discuss specific
>> changes or issues from here. You can reply on the pull request.
>> I don't know what the issue is either at a glance.
>>
>> On Tue, Aug 13, 2019 at 2:54 AM Tianhua huang <huangtianhua223@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> About the arm test of spark, recently we found two tests failed after
>>> the commit https://github.com/apache/spark/pull/23767:
>>>        ReplayListenerSuite:
>>>        - ...
>>>        - End-to-end replay *** FAILED ***
>>>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
>>>        - End-to-end replay with compression *** FAILED ***
>>>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
>>>
>>> We tried to revert the commit and then the tests passed, the patch is
>>> too big and so sorry we can't find the reason till now, if you are
>>> interesting please try it, and it will be very appreciate          if
>>> someone can help us to figure it out.
>>>
>>> On Tue, Aug 6, 2019 at 9:08 AM bo zhaobo <bzhaojyathousandy@gmail.com>
>>> wrote:
>>>
>>>> Hi shane,
>>>> Thanks for your reply. I will wait for you back. ;-)
>>>>
>>>> Thanks,
>>>> Best regards
>>>> ZhaoBo
>>>>
>>>>
>>>>
>>>> [image: Mailtrack]
>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
Sender
>>>> notified by
>>>> Mailtrack
>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
19/08/06
>>>> 上午09:06:23
>>>>
>>>> shane knapp <sknapp@berkeley.edu> 于2019年8月2日周五 下午10:41写道:
>>>>
>>>>> i'm out of town, but will answer some of your questions next week.
>>>>>
>>>>> On Fri, Aug 2, 2019 at 2:39 AM bo zhaobo <bzhaojyathousandy@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi Team,
>>>>>>
>>>>>> Any updates about the CI details? ;-)
>>>>>>
>>>>>> Also, I will also need your kind help about Spark QA test, could
any
>>>>>> one can tell us how to trigger that tests? When? How?  So far, I
haven't
>>>>>> notices how it works.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> ZhaoBo
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: Mailtrack]
>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
Sender
>>>>>> notified by
>>>>>> Mailtrack
>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
19/08/02
>>>>>> 下午05:37:30
>>>>>>
>>>>>> bo zhaobo <bzhaojyathousandy@gmail.com> 于2019年7月31日周三
上午11:56写道:
>>>>>>
>>>>>>> Hi, team.
>>>>>>> I want to make the same test on ARM like existing CI does(x86).
As
>>>>>>> building and testing the whole spark projects will cost too long
time, so I
>>>>>>> plan to split them to multiple jobs to run for lower time cost.
But I
>>>>>>> cannot see what the existing CI[1] have done(so many private
scripts
>>>>>>> called), so could any CI maintainers help/tell us for how to
split them and
>>>>>>> the details about different CI jobs does? Such as PR title contains
[SQL],
>>>>>>> [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib],
>>>>>>> [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them
seems run
>>>>>>> the different CI job.
>>>>>>>
>>>>>>> @shane knapp,
>>>>>>> Oh, sorry for disturb. I found your email looks like from '
>>>>>>> berkeley.edu', are you the good guy who we are looking for help
>>>>>>> about this? ;-)
>>>>>>> If so, could you give some helps or advices? Thank you.
>>>>>>>
>>>>>>> Thank you very much,
>>>>>>>
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> ZhaoBo
>>>>>>>
>>>>>>> [1] https://amplab.cs.berkeley.edu/jenkins
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [image: Mailtrack]
>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
Sender
>>>>>>> notified by
>>>>>>> Mailtrack
>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
19/07/31
>>>>>>> 上午11:53:36
>>>>>>>
>>>>>>> Tianhua huang <huangtianhua223@gmail.com> 于2019年7月29日周一
上午9:38写道:
>>>>>>>
>>>>>>>> @Sean Owen <srowen@gmail.com>  Thank you very much.
And I saw your
>>>>>>>> reply comment in https://issues.apache.org/jira/browse/SPARK-28519,
>>>>>>>> I will test with modification and to see whether there are
other similar
>>>>>>>> tests fail, and will address them together in one pull request.
>>>>>>>>
>>>>>>>> On Sat, Jul 27, 2019 at 9:04 PM Sean Owen <srowen@gmail.com>
wrote:
>>>>>>>>
>>>>>>>>> Great thanks - we can take this to JIRAs now.
>>>>>>>>> I think it's worth changing the implementation of atanh
if the
>>>>>>>>> test value just reflects what Spark does, and there's
evidence is a little
>>>>>>>>> bit inaccurate.
>>>>>>>>> There's an equivalent formula which seems to have better
accuracy.
>>>>>>>>>
>>>>>>>>> On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <
>>>>>>>>> linguin.m.s@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, all,
>>>>>>>>>>
>>>>>>>>>> FYI:
>>>>>>>>>> >> @Yuming Wang the results in float8.sql are
from PostgreSQL
>>>>>>>>>> directly?
>>>>>>>>>> >> Interesting if it also returns the same
less accurate result,
>>>>>>>>>> which
>>>>>>>>>> >> might suggest it's more to do with underlying
OS math
>>>>>>>>>> libraries. You
>>>>>>>>>> >> noted that these tests sometimes gave platform-dependent
>>>>>>>>>> differences
>>>>>>>>>> >> in the last digit, so wondering if the test
value directly
>>>>>>>>>> reflects
>>>>>>>>>> >> PostgreSQL or just what we happen to return
now.
>>>>>>>>>>
>>>>>>>>>> The results in float8.sql.out were recomputed in
Spark/JVM.
>>>>>>>>>> The expected output of the PostgreSQL test is here:
>>>>>>>>>> https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L493
>>>>>>>>>>
>>>>>>>>>> As you can see in the file (float8.out), the results
other than atanh
>>>>>>>>>> also are different between Spark/JVM and PostgreSQL.
>>>>>>>>>> For example, the answers of acosh are:
>>>>>>>>>> -- PostgreSQL
>>>>>>>>>>
>>>>>>>>>> https://github.com/postgres/postgres/blob/master/src/test/regress/expected/float8.out#L487
>>>>>>>>>> 1.31695789692482
>>>>>>>>>>
>>>>>>>>>> -- Spark/JVM
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/pgSQL/float8.sql.out#L523
>>>>>>>>>> 1.3169578969248166
>>>>>>>>>>
>>>>>>>>>> btw, the PostgreSQL implementation for atanh just
calls atanh in
>>>>>>>>>> math.h:
>>>>>>>>>>
>>>>>>>>>> https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/float.c#L2606
>>>>>>>>>>
>>>>>>>>>> Bests,
>>>>>>>>>> Takeshi
>>>>>>>>>>
>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>

Mime
View raw message