spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AssafMendelson <assaf.mendel...@rsa.com>
Subject UDF/UDAF performance
Date Sun, 28 Aug 2016 09:53:56 GMT
I am trying to do a high performance calculations which require custom functions.

As a first stage I am trying to profile the effect of using UDF and I am getting weird results.

I created a simple test (in https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6100413009427980/1838590056414286/3882373005101896/latest.html)

The basic process is as follows:
- I create a dataframe using the range option with 50M records and cache it.
- I then do a filter to find those smaller than 10 and count them. Once by doing column <
10 and once by doing it via UDF.
- I ran each action 10 times to get a good time estimate.

I then tried to run this test on 3 different systems:

-          On databricks system both methods took around the same time: ~4 seconds.

-          On the first on premise cluster (8 nodes, using yarn, each node with ~40GB memory
and plenty of cores): I got a result of 1 second for the first option (no udf) and 8 for the
second (using udf).

-          On the second on premise cluster (2 nodes, using yarn, each node with ~16GB memory
and few cores): I got 1 second for the first option (no udf) and 16 seconds for the second
(using UDF)

I am trying to understand the big differences in performance (I rerun the test several times
on each of the systems and got relatively consistent results).
Thanks,
Assaf.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/UDF-UDAF-performance-tp27612.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Mime
View raw message