spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Ask for ARM CI for spark
Date Fri, 26 Jul 2019 09:46:06 GMT
Interesting. I don't think log(3) is special, it's just that some
differences in how it's implemented and floating-point values on
aarch64 vs x86, or in the JVM, manifest at some values like this. It's
still a little surprising! BTW Wolfram Alpha suggests that the correct
value is more like ...810969..., right between the two. java.lang.Math
doesn't guarantee strict IEEE floating-point behavior, but
java.lang.StrictMath is supposed to, at the potential cost of speed,
and it gives ...81096, in agreement with aarch64.

@Yuming Wang the results in float8.sql are from PostgreSQL directly?
Interesting if it also returns the same less accurate result, which
might suggest it's more to do with underlying OS math libraries. You
noted that these tests sometimes gave platform-dependent differences
in the last digit, so wondering if the test value directly reflects
PostgreSQL or just what we happen to return now.

One option is to use StrictMath in special cases like computing atanh.
That gives a value that agrees with aarch64.
I also note that 0.5 * (math.log(1 + x) - math.log(1 - x) gives the
more accurate answer too, and makes the result agree with, say,
Wolfram Alpha for atanh(0.5).
(Actually if we do that, better still is 0.5 * (math.log1p(x) -
math.log1p(-x)) for best accuracy near 0)
Commons Math also has implementations of sinh, cosh, atanh that we
could call. It claims it's possibly more accurate and faster. I
haven't tested its result here.

FWIW the "log1p" version appears, from some informal testing, to be
most accurate (in agreement with Wolfram) and using StrictMath doesn't
matter. If we change something, I'd use that version above.
The only issue is if this causes the result to disagree with
PostgreSQL, but then again it's more correct and maybe the DB is
wrong.


The rest may be a test vs PostgreSQL issue; see
https://issues.apache.org/jira/browse/SPARK-28316


On Fri, Jul 26, 2019 at 2:32 AM Tianhua huang <huangtianhua223@gmail.com> wrote:
>
> Hi, all
>
>
> Sorry to disturb again, there are several sql tests failed on arm64 instance:
>
> pgSQL/float8.sql *** FAILED ***
> Expected "0.549306144334054[9]", but got "0.549306144334054[8]" Result did not match
for query #56
> SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362)
> pgSQL/numeric.sql *** FAILED ***
> Expected "2 2247902679199174[72 224790267919917955.1326161858
> 4 7405685069595001 7405685069594999.0773399947
> 5 5068226527.321263 5068226527.3212726541
> 6 281839893606.99365 281839893606.9937234336
> 7 1716699575118595840 1716699575118597095.4233081991
> 8 167361463828.0749 167361463828.0749132007
> 9 107511333880051856] 107511333880052007....", but got "2 2247902679199174[40224790267919917955.1326161858
> 4 7405685069595001 7405685069594999.0773399947
> 5 5068226527.321263 5068226527.3212726541
> 6 281839893606.99365 281839893606.9937234336
> 7 1716699575118595580 1716699575118597095.4233081991
> 8 167361463828.0749 167361463828.0749132007
> 9 107511333880051872] 107511333880052007...." Result did not match for query #496
> SELECT t1.id1, t1.result, t2.expected
> FROM num_result t1, num_exp_power_10_ln t2
> WHERE t1.id1 = t2.id
> AND t1.result != t2.expected (SQLQueryTestSuite.scala:362)
>
> The first test failed, because the value of math.log(3.0) is different on aarch64:
>
> # on x86_64:
>
> scala> val a = 0.5
> a: Double = 0.5
>
> scala> a * math.log((1.0 + a) / (1.0 - a))
> res1: Double = 0.5493061443340549
>
> scala> math.log((1.0 + a) / (1.0 - a))
> res2: Double = 1.0986122886681098
>
> # on aarch64:
>
> scala> val a = 0.5
>
> a: Double = 0.5
>
> scala> a * math.log((1.0 + a) / (1.0 - a))
>
> res20: Double = 0.5493061443340548
>
> scala> math.log((1.0 + a) / (1.0 - a))
>
> res21: Double = 1.0986122886681096
>
> And I tried other several numbers like math.log(4.0) and math.log(5.0) and they are same,
I don't know why math.log(3.0) is so special? But the result is different indeed on aarch64.
If you are interesting, please try it.
>
> The second test failed, because some values of pow(10, x) is different on aarch64, according
to sql tests of spark, I took similar tests on aarch64 and x86_64, take '-83028485' as example:
>
> # on x86_64:
> scala> import java.lang.Math._
> import java.lang.Math._
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res4: Int = 83028485
> scala> math.log(abs(a))
> res5: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res6: Double = 1.71669957511859584E18
>
> # on aarch64:
>
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res38: Int = 83028485
>
> scala> math.log(abs(a))
>
> res39: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res40: Double = 1.71669957511859558E18
>
> I send an email to jdk-dev, hope someone can help, and also I proposed this to JIRA 
https://issues.apache.org/jira/browse/SPARK-28519, , if you are interesting, welcome to join
and discuss, thank you very much.
>
>
> On Thu, Jul 18, 2019 at 11:12 AM Tianhua huang <huangtianhua223@gmail.com> wrote:
>>
>> Thanks for your reply.
>>
>> About the first problem we didn't find any other reason in log, just found timeout
to wait the executor up, and after increase the timeout from 10000 ms to 30000(even 20000)ms,
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SparkContextSuite.scala#L764
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SparkContextSuite.scala#L792
 the test passed, and there are more than one executor up, not sure whether it's related with
the flavor of our aarch64 instance? Now the flavor of the instance is 8C8G. Maybe we will
try the bigger flavor later. Or any one has other suggestion, please contact me, thank you.
>>
>> About the second problem, I proposed a pull request to apache/spark, https://github.com/apache/spark/pull/25186
 if you have time, would you please to help to review it, thank you very much.
>>
>> On Wed, Jul 17, 2019 at 8:37 PM Sean Owen <srowen@gmail.com> wrote:
>>>
>>> On Wed, Jul 17, 2019 at 6:28 AM Tianhua huang <huangtianhua223@gmail.com>
wrote:
>>> > Two failed and the reason is 'Can't find 1 executors before 10000 milliseconds
elapsed', see below, then we try increase timeout the tests passed, so wonder if we can increase
the timeout? and here I have another question about https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/TestUtils.scala#L285,
why is not >=? see the comment of the function, it should be >=?
>>> >
>>>
>>> I think it's ">" because the driver is also an executor, but not 100%
>>> sure. In any event it passes in general.
>>> These errors typically mean "I didn't start successfully" for some
>>> other reason that may be in the logs.
>>>
>>> > The other two failed and the reason is '2143289344 equaled 2143289344',
this because the value of floatToRawIntBits(0.0f/0.0f) on aarch64 platform is 2143289344 and
equals to floatToRawIntBits(Float.NaN). About this I send email to jdk-dev and proposed a
topic on scala community https://users.scala-lang.org/t/the-value-of-floattorawintbits-0-0f-0-0f-is-different-on-x86-64-and-aarch64-platforms/4845
and https://github.com/scala/bug/issues/11632, I thought it's something about jdk or scala,
but after discuss, it should related with platform, so seems the following asserts is not
appropriate? https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala#L704-L705
and https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala#L732-L733
>>>
>>> These tests could special-case execution on ARM, like you'll see some
>>> tests handle big-endian architectures.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message