spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject [VOTE] Release Apache Spark 1.5.0 (RC3)
Date Tue, 01 Sep 2015 20:41:46 GMT
Please vote on releasing the following candidate as Apache Spark version
1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see

The tag to be voted on is v1.5.0-rc3:

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release (published as 1.5.0-rc3) can be
found at:

The staging repository for this release (published as 1.5.0) can be found

The documentation corresponding to this release can be found at:

How can I help test this release?
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

What justifies a -1 vote for this release?
This vote is happening towards the end of the 1.5 QA period, so -1 votes
should only occur for significant regressions from 1.4. Bugs already
present in 1.4, minor regressions, or bugs related to new features will not
block this release.

What should happen to JIRA tickets still targeting 1.5.0?
1. It is OK for documentation patches to target 1.5.0 and still go into
branch-1.5, since documentations will be packaged separately from the
2. New features for non-alpha-modules should target 1.6+.
3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target

Major changes to help you focus your testing

As of today, Spark 1.5 contains more than 1000 commits from 220+
contributors. I've curated a list of important changes for 1.5. For the
complete list, please refer to Apache JIRA changelog.

RDD/DataFrame/SQL APIs

- New UDAF interface
- DataFrame hints for broadcast join
- expr function for turning a SQL expression into DataFrame column
- Improved support for NaN values
- StructType now supports ordering
- TimestampType precision is reduced to 1us
- 100 new built-in expressions, including date/time, string, math
- memory and local disk only checkpointing

DataFrame/SQL Backend Execution

- Code generation on by default
- Improved join, aggregation, shuffle, sorting with cache friendly
algorithms and external algorithms
- Improved window function performance
- Better metrics instrumentation and reporting for DF/SQL execution plans

Data Sources, Hive, Hadoop, Mesos and Cluster Management

- Dynamic allocation support in all resource managers (Mesos, YARN,
- Improved Mesos support (framework authentication, roles, dynamic
allocation, constraints)
- Improved YARN support (dynamic allocation with preferred locations)
- Improved Hive support (metastore partition pruning, metastore
connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
- Support persisting data in Hive compatible format in metastore
- Support data partitioning for JSON data sources
- Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
discovery and schema merging, support reading non-standard legacy Parquet
files generated by other libraries)
- Faster and more robust dynamic partition insert
- DataSourceRegister interface for external data sources to specify short


- YARN cluster mode in R
- GLMs with R formula, binomial/Gaussian families, and elastic-net
- Improved error messages
- Aliases to make DataFrame functions more R-like


- Backpressure for handling bursty input streams.
- Improved Python support for streaming sources (Kafka offsets, Kinesis,
MQTT, Flume)
- Improved Python streaming machine learning algorithms (K-Means, linear
regression, logistic regression)
- Native reliable Kinesis stream support
- Input metadata like Kafka offsets made visible in the batch details UI
- Better load balancing and scheduling of receivers across cluster
- Include streaming storage in web UI

Machine Learning and Advanced Analytics

- Feature transformers: CountVectorizer, Discrete Cosine transformation,
MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
- Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
- Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
- Improvements to existing algorithms: LDA, trees/ensembles, GMMs
- More efficient Pregel API implementation for GraphX
- Model summary for linear and logistic regression.
- Python API: distributed matrices, streaming k-means and linear models,
LDA, power iteration clustering, etc.
- Tuning and evaluation: train-validation split and multiclass
classification evaluator.
- Documentation: document the release version of public API methods

View raw message