spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tianhua huang <huangtianhua...@gmail.com>
Subject Re: Ask for ARM CI for spark
Date Wed, 17 Jul 2019 10:27:31 GMT
Hi all,

We run all unit tests for spark on arm64 platform, after effort there are
four tests FAILED, see
https://logs.openlabtesting.org/logs/4/4/ae5ebaddd6ba6eba5a525b2bf757043ebbe78432/check/spark-build-arm64/9ecccad/job-output.txt.gz

Two failed and the reason is 'Can't find 1 executors before 10000
milliseconds elapsed', see below, then we try increase timeout the tests
passed, so wonder if we can increase the timeout? and here I have another
question about
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/TestUtils.scala#L285,
why is not >=? see the comment of the function, it should be >=?

- test driver discovery under local-cluster mode *** FAILED ***
  java.util.concurrent.TimeoutException: Can't find 1 executors before
10000 milliseconds elapsed
  at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293)
  at org.apache.spark.SparkContextSuite.$anonfun$new$78(SparkContextSuite.scala:753)
  at org.apache.spark.SparkContextSuite.$anonfun$new$78$adapted(SparkContextSuite.scala:741)
  at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
  at org.apache.spark.SparkContextSuite.$anonfun$new$77(SparkContextSuite.scala:741)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)

- test gpu driver resource files and discovery under local-cluster
mode *** FAILED ***
  java.util.concurrent.TimeoutException: Can't find 1 executors before
10000 milliseconds elapsed
  at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293)
  at org.apache.spark.SparkContextSuite.$anonfun$new$80(SparkContextSuite.scala:781)
  at org.apache.spark.SparkContextSuite.$anonfun$new$80$adapted(SparkContextSuite.scala:761)
  at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
  at org.apache.spark.SparkContextSuite.$anonfun$new$79(SparkContextSuite.scala:761)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)

The other two failed and the reason is '2143289344 equaled
2143289344', this because the value of floatToRawIntBits(0.0f/0.0f) on
aarch64 platform is 2143289344 and equals to
floatToRawIntBits(Float.NaN). About this I send email to jdk-dev and
proposed a topic on scala community
https://users.scala-lang.org/t/the-value-of-floattorawintbits-0-0f-0-0f-is-different-on-x86-64-and-aarch64-platforms/4845
and https://github.com/scala/bug/issues/11632, I thought it's
something about jdk or scala, but after discuss, it should related
with platform, so seems the following asserts is not appropriate?
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala#L704-L705
and https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala#L732-L733

 - SPARK-26021: NaN and -0.0 in grouping expressions *** FAILED ***
   2143289344 equaled 2143289344 (DataFrameAggregateSuite.scala:732)
 - NaN and -0.0 in window partition keys *** FAILED ***
   2143289344 equaled 2143289344 (DataFrameWindowFunctionsSuite.scala:704)

About the failed tests fixing, we are waiting for your suggestions,
thank you very much.


On Wed, Jul 10, 2019 at 10:07 AM Tianhua huang <huangtianhua223@gmail.com>
wrote:

> Hi all,
>
> I am glad to tell you there is a new progress of build/test spark on
> aarch64 server, the tests are running, see the build/test detail log
> https://logs.openlabtesting.org/logs/1/1/419fcb11764048d5a3cda186ea76dd43249e1f97/check/spark-build-arm64/75cc6f5/job-output.txt.gz
and
> the aarch64 instance info see
> https://logs.openlabtesting.org/logs/1/1/419fcb11764048d5a3cda186ea76dd43249e1f97/check/spark-build-arm64/75cc6f5/zuul-info/zuul-info.ubuntu-xenial-arm64.txt
In
> order to enable the test, I made some modification, the major one is to
> build leveldbjni local package, I forked fusesource/leveldbjni and
> chirino/leveldb repos, and made some modification to make sure to build the
> local package, see https://github.com/huangtianhua/leveldbjni/pull/1 and
> https://github.com/huangtianhua/leveldbjni/pull/2 , then to use it in
> spark, the detail you can find in
> https://github.com/theopenlab/spark/pull/1
>
> Now the tests are not all successful, I will try to fix it and any
> suggestion is welcome, thank you all.
>
> On Mon, Jul 1, 2019 at 5:25 PM Tianhua huang <huangtianhua223@gmail.com>
> wrote:
>
>> We are focus on the arm instance of cloud, and now I use the arm instance
>> of vexxhost cloud to run the build job which mentioned above, the
>> specification of the arm instance is 8VCPU and 8GB of RAM,
>> and we can use bigger flavor to create the arm instance to run the job,
>> if need be.
>>
>> On Fri, Jun 28, 2019 at 6:55 PM Steve Loughran
>> <stevel@cloudera.com.invalid> wrote:
>>
>>>
>>> Be interesting to see how well a Pi4 works; with only 4GB of RAM you
>>> wouldn't compile with it, but you could try installing the spark jar bundle
>>> and then run against some NFS mounted disks:
>>> https://www.raspberrypi.org/magpi/raspberry-pi-4-specs-benchmarks/ ;
>>> unlikely to be fast, but it'd be an efficient kind of slow
>>>
>>> On Fri, Jun 28, 2019 at 3:08 AM Rui Chen <chenrui.momo@gmail.com> wrote:
>>>
>>>> >  I think any AA64 work is going to have to define very clearly what
>>>> "works" is defined as
>>>>
>>>> +1
>>>> It's very valuable to build a clear scope of these projects
>>>> functionality for ARM platform in upstream community, it bring confidence
>>>> to end user and customers when they plan to deploy these projects on ARM.
>>>>
>>>> This is absolute long term work, let's to make it step by step, CI,
>>>> testing, issue and resolving.
>>>>
>>>> Steve Loughran <stevel@cloudera.com.invalid> 于2019年6月27日周四
下午9:22写道:
>>>>
>>>>> level db and native codecs are invariably a problem here, as is
>>>>> anything else doing misaligned IO. Protobuf has also had "issues" in
the
>>>>> past
>>>>>
>>>>> see https://issues.apache.org/jira/browse/HADOOP-16100
>>>>>
>>>>> I think any AA64 work is going to have to define very clearly what
>>>>> "works" is defined as; spark standalone with a specific set of codecs
is
>>>>> probably the first thing to aim for -no Snappy or lz4.
>>>>>
>>>>> Anything which goes near: protobuf, checksums, native code, etc is in
>>>>> trouble. Don't try and deploy with HDFS as the cluster FS, would be my
>>>>> recommendation.
>>>>>
>>>>> If you want a cluster use NFS or one of google GCS, Azure WASB for the
>>>>> cluster FS. And before trying either of those cloud store, run the
>>>>> filesystem connector test suites (hadoop-azure; google gcs github) to
see
>>>>> that they work. If the foundational FS test suites fail, nothing else
will
>>>>> work
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 27, 2019 at 3:09 AM Tianhua huang <
>>>>> huangtianhua223@gmail.com> wrote:
>>>>>
>>>>>> I took the ut tests on my arm instance before and reported an issue
>>>>>> in https://issues.apache.org/jira/browse/SPARK-27721,  and seems
>>>>>> there was no leveldbjni native package for aarch64 in leveldbjni-all.jar(or
>>>>>> 1.8)
>>>>>> https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8
>>>>>> , we can find https://github.com/fusesource/leveldbjni/pull/82 this
>>>>>> pr added the aarch64 support and merged on 2 Nov 2017, but the latest
>>>>>> release of the repo is  on 17 Oct 2013, unfortunately it didn't
>>>>>> include the aarch64 supporting.
>>>>>>
>>>>>> I will running the test on the job mentioned above, and will try
to
>>>>>> fix the issue above, or if anyone have any idea of it, welcome reply
me,
>>>>>> thank you.
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 26, 2019 at 8:11 PM Sean Owen <srowen@gmail.com>
wrote:
>>>>>>
>>>>>>> Can you begin by testing yourself? I think the first step is
to make
>>>>>>> sure the build and tests work on ARM. If you find problems you
can
>>>>>>> isolate them and try to fix them, or at least report them. It's
only
>>>>>>> worth getting CI in place when we think builds will work.
>>>>>>>
>>>>>>> On Tue, Jun 25, 2019 at 9:26 PM Tianhua huang <
>>>>>>> huangtianhua223@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Thanks Shane :)
>>>>>>> >
>>>>>>> > This sounds good, and yes I agree that it's best to keep
the
>>>>>>> test/build infrastructure in one place. If you can't find the
ARM resource
>>>>>>> we are willing to support the ARM instance :)  Our goal is to
make more
>>>>>>> open source software to be more compatible for aarch64 platform,
so let's
>>>>>>> to do it. I will be happy if I can give some help for the goal.
>>>>>>> >
>>>>>>> > Waiting for you good news :)
>>>>>>> >
>>>>>>> > On Wed, Jun 26, 2019 at 9:47 AM shane knapp <sknapp@berkeley.edu>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> ...or via VM as you mentioned earlier.  :)
>>>>>>> >>
>>>>>>> >> shane (who will file a JIRA tomorrow)
>>>>>>> >>
>>>>>>> >> On Tue, Jun 25, 2019 at 6:44 PM shane knapp <sknapp@berkeley.edu>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> i'd much prefer that we keep the test/build infrastructure
in
>>>>>>> one place.
>>>>>>> >>>
>>>>>>> >>> we don't have ARM hardware, but there's a slim possibility
i can
>>>>>>> scare something up in our older research stock...
>>>>>>> >>>
>>>>>>> >>> another option would be to run the build in a arm-based
docker
>>>>>>> container, which (according to the intarwebs) is possible.
>>>>>>> >>>
>>>>>>> >>> shane
>>>>>>> >>>
>>>>>>> >>> On Tue, Jun 25, 2019 at 6:35 PM Tianhua huang <
>>>>>>> huangtianhua223@gmail.com> wrote:
>>>>>>> >>>>
>>>>>>> >>>> I forked apache/spark project and propose a
job(
>>>>>>> https://github.com/theopenlab/spark/pull/1) for spark building
in
>>>>>>> OpenLab ARM instance, this is the first step to build spark on
ARM,  I can
>>>>>>> enable a periodic job for arm building for apache/spark master
if you guys
>>>>>>> like.  Later I will run tests for spark. I also willing to be
the
>>>>>>> maintainer of the arm ci of spark.
>>>>>>> >>>>
>>>>>>> >>>> Thanks for you attention.
>>>>>>> >>>>
>>>>>>> >>>> On Thu, Jun 20, 2019 at 10:17 AM Tianhua huang
<
>>>>>>> huangtianhua223@gmail.com> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>> Thanks Sean.
>>>>>>> >>>>>
>>>>>>> >>>>> I am very happy to hear that the community
will put effort to
>>>>>>> fix the ARM-related issues. I'd be happy to help if you like.
And could you
>>>>>>> give the trace link of this issue, then I can check it is fixed
or not,
>>>>>>> thank you.
>>>>>>> >>>>> As far as I know the old versions of spark
support ARM, and
>>>>>>> now the new versions don't, this just shows that we need a CI
to check
>>>>>>> whether the spark support ARM and whether some modification break
it.
>>>>>>> >>>>> I will add a demo job in OpenLab to build
spark on ARM and do
>>>>>>> a simple UT test. Later I will give the job link.
>>>>>>> >>>>>
>>>>>>> >>>>> Let me know what you think.
>>>>>>> >>>>>
>>>>>>> >>>>> Thank you all!
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> On Wed, Jun 19, 2019 at 8:47 PM Sean Owen
<srowen@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>> I'd begin by reporting and fixing ARM-related
issues in the
>>>>>>> build. If
>>>>>>> >>>>>> they're small, of course we should do
them. If it requires
>>>>>>> significant
>>>>>>> >>>>>> modifications, we can discuss how much
Spark can support ARM.
>>>>>>> I don't
>>>>>>> >>>>>> think it's yet necessary for the Spark
project to run these
>>>>>>> CI builds
>>>>>>> >>>>>> until that point, but it's always welcome
if people are
>>>>>>> testing that
>>>>>>> >>>>>> separately.
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Wed, Jun 19, 2019 at 7:41 AM Holden
Karau <
>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>> >>>>>> >
>>>>>>> >>>>>> > Moving to dev@ for increased visibility
among the
>>>>>>> developers.
>>>>>>> >>>>>> >
>>>>>>> >>>>>> > On Wed, Jun 19, 2019 at 1:24 AM
Tianhua huang <
>>>>>>> huangtianhua223@gmail.com> wrote:
>>>>>>> >>>>>> >>
>>>>>>> >>>>>> >> Thanks for your reply.
>>>>>>> >>>>>> >>
>>>>>>> >>>>>> >> As I said before, I met some
problem of build or test for
>>>>>>> spark on aarch64 server, so it will be better to have the ARM
CI to make
>>>>>>> sure the spark is compatible for AArch64 platforms.
>>>>>>> >>>>>> >>
>>>>>>> >>>>>> >> I’m from OpenLab team(https://openlabtesting.org/
,a
>>>>>>> community to do open source project testing. And we can support
some Arm
>>>>>>> virtual machines to AMPLab Jenkins, and also we have a developer
team that
>>>>>>> willing to work on this, we willing to maintain build CI jobs
and address
>>>>>>> the CI issues.  What do you think?
>>>>>>> >>>>>> >>
>>>>>>> >>>>>> >>
>>>>>>> >>>>>> >> Thanks for your attention.
>>>>>>> >>>>>> >>
>>>>>>> >>>>>> >>
>>>>>>> >>>>>> >> On Wed, Jun 19, 2019 at 6:39
AM shane knapp <
>>>>>>> sknapp@berkeley.edu> wrote:
>>>>>>> >>>>>> >>>
>>>>>>> >>>>>> >>> yeah, we don't have any
aarch64 systems for testing...
>>>>>>> this has been asked before but is currently pretty low on our
priority list
>>>>>>> as we don't have the hardware.
>>>>>>> >>>>>> >>>
>>>>>>> >>>>>> >>> sorry,
>>>>>>> >>>>>> >>>
>>>>>>> >>>>>> >>> shane
>>>>>>> >>>>>> >>>
>>>>>>> >>>>>> >>> On Mon, Jun 10, 2019 at
7:08 PM Tianhua huang <
>>>>>>> huangtianhua223@gmail.com> wrote:
>>>>>>> >>>>>> >>>>
>>>>>>> >>>>>> >>>> Hi, sorry to disturb
you.
>>>>>>> >>>>>> >>>> The CI testing for
apache spark is supported by AMPLab
>>>>>>> Jenkins, and I find there are some computers(most of them are
Linux (amd64)
>>>>>>> arch) for the CI development, but seems there is no Aarch64 computer
for
>>>>>>> spark CI testing. Recently, I build and run test for spark(master
and
>>>>>>> branch-2.4) on my arm server, and unfortunately there are some
problems,
>>>>>>> for example, ut test is failed due to a LEVELDBJNI native package,
the
>>>>>>> details for java test see http://paste.openstack.org/show/752063/
>>>>>>> and python test see http://paste.openstack.org/show/752709/
>>>>>>> >>>>>> >>>> So I have a question
about the ARM CI testing for spark,
>>>>>>> is there any plan to support it? Thank you very much and I will
wait for
>>>>>>> your reply!
>>>>>>> >>>>>> >>>
>>>>>>> >>>>>> >>>
>>>>>>> >>>>>> >>>
>>>>>>> >>>>>> >>> --
>>>>>>> >>>>>> >>> Shane Knapp
>>>>>>> >>>>>> >>> UC Berkeley EECS Research
/ RISELab Staff Technical Lead
>>>>>>> >>>>>> >>> https://rise.cs.berkeley.edu
>>>>>>> >>>>>> >
>>>>>>> >>>>>> >
>>>>>>> >>>>>> >
>>>>>>> >>>>>> > --
>>>>>>> >>>>>> > Twitter: https://twitter.com/holdenkarau
>>>>>>> >>>>>> > Books (Learning Spark, High Performance
Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9
>>>>>>> >>>>>> > YouTube Live Streams:
>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> --
>>>>>>> >>> Shane Knapp
>>>>>>> >>> UC Berkeley EECS Research / RISELab Staff Technical
Lead
>>>>>>> >>> https://rise.cs.berkeley.edu
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> Shane Knapp
>>>>>>> >> UC Berkeley EECS Research / RISELab Staff Technical
Lead
>>>>>>> >> https://rise.cs.berkeley.edu
>>>>>>>
>>>>>>

Mime
View raw message