On 2015/12/17 6:32, Michael Armbrust wrote:
Please vote on releasing the following candidate as Apache Spark version 1.6.0!

The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)

The release files, including signatures, digests, etc. can be found at:

Release artifacts are signed with the following key:

The staging repository for this release can be found at:

The test repository (versioned as v1.6.0-rc3) for this release can be found at:

The documentation corresponding to this release can be found at:

== How can I help test this release? ==
If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.

== What justifies a -1 vote for this release? ==
This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release.

== What should happen to JIRA tickets still targeting 1.6.0? ==
1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version.

== Major changes to help you focus your testing ==

Notable changes since 1.6 RC2

- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed

Notable changes since 1.6 RC1

Spark Streaming

  • SPARK-2629  trackStateByKey has been renamed to mapWithState

Spark SQL

Notable Features Since 1.5

Spark SQL

  • SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
  • SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) even on shared clusters.
  • SPARK-9999  Dataset API - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten).
  • SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
  • SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
  • SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
  • SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a peroperator basis for memory usage and spilled data size.
  • SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unest arbitrary numbers of columns
  • SPARK-10917SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
  • SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
  • SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
  • SPARK-10978 Datasource API Avoid Double Filter - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
  • SPARK-4849  Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
  • SPARK-9858  Adaptive query execution - Intial support for automatically selecting the number of reducers for joins and aggregations.
  • SPARK-9241  Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.

Spark Streaming

  • API Updates
    • SPARK-2629  New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
    • SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
    • SPARK-10891 Kinesis message handler function - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
    • SPARK-6328  Python Streamng Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
  • UI Improvements
    • Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
    • Made output operations visible in the streaming tab as progress bars.


New algorithms/models

  • SPARK-8518  Survival analysis - Log-linear model for survival analysis
  • SPARK-9834  Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
  • SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming framework
  • SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
  • SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant of K-Means

API improvements

  • ML Pipelines
    • SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.mlalgorithms
    • SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
  • R API
    • SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
    • SPARK-9681  Feature interactions in R formula - Interaction operator ":" in R formula
  • Python API - Many improvements to Python API to approach feature parity

Misc improvements

  • SPARK-7685 SPARK-9642  Instance weights for GLMs - Logistic and Linear Regression can take instance weights
  • SPARK-10384SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
  • SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source

    Documentation improvements

  • SPARK-7751  @since versions - Documentation includes initial version when classes and methods were added
  • SPARK-11337 Testable example code - Automated testing for code in user guide examples


  • In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated.
  • In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms.

Changes of behavior

  • spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.
  • spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.
  • Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).
  • When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (SPARK-11724).
  • With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).