spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Roberts <>
Subject Re: [VOTE] Apache Spark 2.1.0 (RC5)
Date Sun, 18 Dec 2016 20:33:04 GMT
+1 (non-binding)

Functional: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's latest 
SDK for Java (8 SR3 FP21).

Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM 
specific platforms including big-endian. On slower machines I see these 
failing but nothing to be concerned over (timeouts):

org.apache.spark.DistributedSuite.caching on disk
org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails 
with informative message
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_time, complete mode
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_date, complete mode

Performance vs 2.0.2: lots of improvements seen using the HiBench and 
SparkSqlPerf benchmarks, tested with a 48 core Intel machine using the 
Kryo serializer, controlled test environment. These are all open source 
benchmarks anyone can use and experiment with. Elapsed times measured, + 
scores are an improvement (so it's that much percent faster) and - scores 
are used for regressions I'm seeing.

K-means: Java API +22% (100 sec to 78 sec), Scala API +30% (34 seconds to 
24 seconds), Python API unchanged
PageRank: minor improvement from 40 seconds to 38 seconds, +5%
Sort: minor improvement, 10.8 seconds to 9.8 seconds, +10%
WordCount: unchanged
Bayes: mixed bag, sometimes much slower (95 sec to 140 sec) which is -47%, 
other times marginally faster by 15%, something to keep an eye on
Terasort: +18% (39 seconds to 32 seconds) with the Java/Scala APIs

For TPC-DS SQL queries the results are a mixed bag again, I see > 10% 
boosts for q9,  q68, q75, q96 and > 10% slowdowns for q7, q39a, q43, q52, 
q57, q89. Five iterations, average times compared, only changing which 
version of Spark we're using

From:   Holden Karau <>
To:     Denny Lee <>, Liwei Lin <>, 
"" <>
Date:   18/12/2016 20:05
Subject:        Re: [VOTE] Apache Spark 2.1.0 (RC5)

+1 (non-binding) - checked Python artifacts with virtual env.

On Sun, Dec 18, 2016 at 11:42 AM Denny Lee <> wrote:
+1 (non-binding)

On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin <> wrote:


On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang <> wrote:
I hope can be fixed until 
release 2.1.0. It's a fix for broadcast cannot fit in memory.

On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley <> 

On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier <> wrote:

On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li <> wrote:

Xiao Li

2016-12-16 12:19 GMT-08:00 Felix Cheung <>:

For R we have a license field in the DESCRIPTION, and this is standard 
practice (and requirement) for R packages.

From: Sean Owen <>

Sent: Friday, December 16, 2016 9:57:15 AM

To: Reynold Xin;

Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5)


(If you have a template for these emails, maybe update it to use https 
links. They work for domains. After all we are asking people to verify the integrity 
of release artifacts, so it might as well be secure.)

(Also the new archives use .tar.gz instead of .tgz like the others. No big 
deal, my OCD eye just noticed it.)

I don't see an Apache license / notice for the Pyspark or SparkR 
artifacts. It would be good practice to include this in a convenience 
binary. I'm not sure if it's strictly mandatory, but something to adjust 
in any event. I think that's all there is to

do for SparkR. For Pyspark, which packages a bunch of dependencies, it 
does include the licenses (good) but I think it should include the NOTICE 

This is the first time I recall getting 0 test failures off the bat!

I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.

I think I'd +1 this therefore unless someone knows that the license issue 
above is real and a blocker.

On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin <> wrote:

Please vote on releasing the following candidate as Apache Spark version 
2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and 
passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see

The tag to be voted on is v2.1.0-rc5 

List of JIRA tickets resolved are:

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The documentation corresponding to this release can be found at:


How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions.

What should happen to JIRA tickets still targeting 2.1.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked 
on immediately. Everything else please retarget to 2.1.1 or 2.2.0.

What happened to RC3/RC5?

They had issues withe release packaging and as a result were skipped.

Herman van Hövell
Software Engineer
Databricks Inc.
+31 6 420 590 27

Joseph Bradley
Software Engineer - Machine Learning
Databricks, Inc.

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

View raw message