vote on releasing the following candidate as Apache Spark
The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.
] +1 Release this package as Apache Spark 1.6.0
] -1 Do not release this package because ...
release files, including signatures, digests, etc. can be
artifacts are signed with the following key:
The staging repository for this release can be
test repository (versioned as v1.6.0-rc3) for this release
can be found at:
documentation corresponding to this release can be found at:
How can I help test this release? ==
you are a Spark user, you can help us test this release by
taking an existing Spark workload and running on this
release candidate, then reporting any regressions.
What justifies a -1 vote for this release? ==
vote is happening towards the end of the 1.6 QA period, so
-1 votes should only occur for significant regressions from
1.5. Bugs already present in 1.5, minor regressions, or bugs
related to new features will not block this release.
What should happen to JIRA tickets still targeting 1.6.0? ==
It is OK for documentation patches to target 1.6.0 and still
go into branch-1.6, since documentations will be published
separately from the release.
New features for non-alpha-modules should target 1.7+.
Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop
the target version.
Major changes to help you focus your testing ==
changes since 1.6 RC2
- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed
changes since 1.6 RC1
been renamed to
Notable Features Since 1.5
- SPARK-11787 Parquet Performance -
Improve Parquet scan performance when using flat
- SPARK-10810 Session Management - Isolated devault database
mydb) even on shared clusters.
- SPARK-9999 Dataset API - A
type-safe API (similar to RDDs) that performs many
operations on serialized binary data and code generation
(i.e. Project Tungsten).
- SPARK-10000 Unified Memory Management -
Shared memory for execution and caching instead of
exclusive division of the regions.
- SPARK-11197 SQL Queries on Files -
Concise syntax for running SQL queries over files of any
supported format without registering a table.
- SPARK-11745 Reading non-standard JSON
files - Added options to read non-standard JSON
files (e.g. single-quotes, unquoted attributes)
- SPARK-10412 Per-operator Metrics for SQL
Execution - Display statistics on a peroperator
basis for memory usage and spilled data size.
- SPARK-11329 Star (*) expansion for
StructTypes - Makes it easier to nest and unest
arbitrary numbers of columns
- SPARK-10917, SPARK-11149 In-memory Columnar Cache
Performance - Significant (up to 14x) speed up
when caching data that contains complex types in
DataFrames or SQL.
- SPARK-11111 Fast null-safe joins -
Joins using null-safe equality (
will now execute using SortMergeJoin instead of
computing a cartisian product.
- SPARK-11389 SQL Execution Using Off-Heap
Memory - Support for configuring query
execution to occur using off-heap memory to avoid GC
- SPARK-10978 Datasource API Avoid Double
Filter - When implemeting a datasource with
filter pushdown, developers can now tell Spark SQL to
avoid double evaluating a pushed-down filter.
- SPARK-4849 Advanced Layout of Cached Data -
storing partitioning and ordering schemes in In-memory
table scan, and adding distributeBy and localSort to DF
- SPARK-9858 Adaptive query execution -
Intial support for automatically selecting the number of
reducers for joins and aggregations.
- SPARK-9241 Improved query planner for
queries having distinct aggregations - Query
plans of distinct aggregations are more robust when
distinct columns have high cardinality.
- API Updates
- SPARK-2629 New improved state
a DStream transformation for stateful stream
functionality and performance.
- SPARK-11198 Kinesis record
deaggregation - Kinesis streams have been
upgraded to use KCL 1.4.0 and supports transparent
deaggregation of KPL-aggregated records.
- SPARK-10891 Kinesis message handler
function - Allows arbitraray function to be
applied to a Kinesis record in the Kinesis receiver
before to customize what data is to be stored in
- SPARK-6328 Python Streamng Listener
API - Get streaming statistics (scheduling
delays, batch processing times, etc.) in streaming.
- UI Improvements
- Made failures visible in
the streaming tab, in the timelines, batch list, and
batch details page.
- Made output operations
visible in the streaming tab as progress bars.
- SPARK-8518 Survival analysis -
Log-linear model for survival analysis
- SPARK-9834 Normal equation for least
squares - Normal equation solver, providing
R-like model summary statistics
- SPARK-3147 Online hypothesis testing -
A/B testing in the Spark Streaming framework
- SPARK-9930 New feature transformers -
ChiSqSelector, QuantileDiscretizer, SQL transformer
- SPARK-6517 Bisecting K-Means clustering -
Fast top-down clustering variant of K-Means
- ML Pipelines
- SPARK-6725 Pipeline persistence -
Save/load for ML Pipelines, with partial coverage
- SPARK-5565 LDA in ML Pipelines -
API for Latent Dirichlet Allocation in ML Pipelines
- R API
- SPARK-9836 R-like statistics for GLMs -
(Partial) R-like stats for ordinary least squares
- SPARK-9681 Feature interactions in R
formula - Interaction operator ":" in R
- Python API - Many
improvements to Python API to approach feature parity
- SPARK-7685 , SPARK-9642 Instance weights for GLMs -
Logistic and Linear Regression can take instance weights
- SPARK-10384, SPARK-10385 Univariate and bivariate
statistics in DataFrames - Variance, stddev,
- SPARK-10117 LIBSVM data source -
LIBSVM as a SQL data source
- SPARK-7751 @since versions -
Documentation includes initial version when classes and
methods were added
- SPARK-11337 Testable example code -
Automated testing for code in user guide examples
spark.mllib.clustering.KMeans, the "runs" parameter has
spark.ml.regression.LinearRegressionModel, the "weights"
field has been deprecated, in favor of the new name
"coefficients." This helps disambiguate from instance
(row) weights given to algorithms.
Changes of behavior
validationTol has changed semantics in 1.6. Previously,
it was a threshold for absolute change in error. Now, it
resembles the behavior of GradientDescent
convergenceTol: For large errors, it uses relative error
(relative to the previous error); for small errors (<
0.01), it uses absolute error.
Previously, it did not convert strings to lowercase
before tokenizing. Now, it converts to lowercase by
default, with an option not to. This matches the
behavior of the simpler Tokenizer transformer.
- Spark SQL's partition
discovery has been changed to only discover partition
directories that are children of the given path. (i.e.
no longer be considered a partition but only children
This behavior can be overridden by manually specifying
partitioning discovery should start with (SPARK-11678).
- When casting a value of an
integral type to timestamp (e.g. casting a long value to
timestamp), the value is treated as being in seconds
instead of milliseconds (SPARK-11724).
- With the improved query
planner for queries having distinct aggregations (SPARK-9241),
the plan of a query having a single distinct aggregation
has been changed to a more robust version. To switch
back to the plan generated by Spark 1.5's planner,