spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject [VOTE] Release Apache Spark 1.5.0 (RC3)
Date Tue, 01 Sep 2015 20:41:46 GMT
Please vote on releasing the following candidate as Apache Spark version
1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/


The tag to be voted on is v1.5.0-rc3:
https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release (published as 1.5.0-rc3) can be
found at:
https://repository.apache.org/content/repositories/orgapachespark-1143/

The staging repository for this release (published as 1.5.0) can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1142/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/


=======================================
How can I help test this release?
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.


================================================
What justifies a -1 vote for this release?
================================================
This vote is happening towards the end of the 1.5 QA period, so -1 votes
should only occur for significant regressions from 1.4. Bugs already
present in 1.4, minor regressions, or bugs related to new features will not
block this release.


===============================================================
What should happen to JIRA tickets still targeting 1.5.0?
===============================================================
1. It is OK for documentation patches to target 1.5.0 and still go into
branch-1.5, since documentations will be packaged separately from the
release.
2. New features for non-alpha-modules should target 1.6+.
3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
version.


==================================================
Major changes to help you focus your testing
==================================================

As of today, Spark 1.5 contains more than 1000 commits from 220+
contributors. I've curated a list of important changes for 1.5. For the
complete list, please refer to Apache JIRA changelog.

RDD/DataFrame/SQL APIs

- New UDAF interface
- DataFrame hints for broadcast join
- expr function for turning a SQL expression into DataFrame column
- Improved support for NaN values
- StructType now supports ordering
- TimestampType precision is reduced to 1us
- 100 new built-in expressions, including date/time, string, math
- memory and local disk only checkpointing

DataFrame/SQL Backend Execution

- Code generation on by default
- Improved join, aggregation, shuffle, sorting with cache friendly
algorithms and external algorithms
- Improved window function performance
- Better metrics instrumentation and reporting for DF/SQL execution plans

Data Sources, Hive, Hadoop, Mesos and Cluster Management

- Dynamic allocation support in all resource managers (Mesos, YARN,
Standalone)
- Improved Mesos support (framework authentication, roles, dynamic
allocation, constraints)
- Improved YARN support (dynamic allocation with preferred locations)
- Improved Hive support (metastore partition pruning, metastore
connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
- Support persisting data in Hive compatible format in metastore
- Support data partitioning for JSON data sources
- Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
discovery and schema merging, support reading non-standard legacy Parquet
files generated by other libraries)
- Faster and more robust dynamic partition insert
- DataSourceRegister interface for external data sources to specify short
names

SparkR

- YARN cluster mode in R
- GLMs with R formula, binomial/Gaussian families, and elastic-net
regularization
- Improved error messages
- Aliases to make DataFrame functions more R-like

Streaming

- Backpressure for handling bursty input streams.
- Improved Python support for streaming sources (Kafka offsets, Kinesis,
MQTT, Flume)
- Improved Python streaming machine learning algorithms (K-Means, linear
regression, logistic regression)
- Native reliable Kinesis stream support
- Input metadata like Kafka offsets made visible in the batch details UI
- Better load balancing and scheduling of receivers across cluster
- Include streaming storage in web UI

Machine Learning and Advanced Analytics

- Feature transformers: CountVectorizer, Discrete Cosine transformation,
MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
- Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
regression.
- Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
test.
- Improvements to existing algorithms: LDA, trees/ensembles, GMMs
- More efficient Pregel API implementation for GraphX
- Model summary for linear and logistic regression.
- Python API: distributed matrices, streaming k-means and linear models,
LDA, power iteration clustering, etc.
- Tuning and evaluation: train-validation split and multiclass
classification evaluator.
- Documentation: document the release version of public API methods

Mime
View raw message