spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kousuke Saruta <saru...@oss.nttdata.co.jp>
Subject Re: [VOTE] Release Apache Spark 1.6.0 (RC3)
Date Thu, 17 Dec 2015 13:21:07 GMT
+1

On 2015/12/17 6:32, Michael Armbrust wrote:
> Please vote on releasing the following candidate as Apache Spark 
> version 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and 
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is _v1.6.0-rc3 
> (168c89e07c51fa24b0bb88582c739cec0acb44d7) 
> <https://github.com/apache/spark/tree/v1.6.0-rc3>_
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/ 
> <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc3-bin/>
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be 
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/ <http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc3-docs/>
>
> =======================================
> == How can I help test this release? ==
> =======================================
> If you are a Spark user, you can help us test this release by taking 
> an existing Spark workload and running on this release candidate, then 
> reporting any regressions.
>
> ================================================
> == What justifies a -1 vote for this release? ==
> ================================================
> This vote is happening towards the end of the 1.6 QA period, so -1 
> votes should only occur for significant regressions from 1.5. Bugs 
> already present in 1.5, minor regressions, or bugs related to new 
> features will not block this release.
>
> ===============================================================
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===============================================================
> 1. It is OK for documentation patches to target 1.6.0 and still go 
> into branch-1.6, since documentations will be published separately 
> from the release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the 
> target version.
>
>
> ==================================================
> == Major changes to help you focus your testing ==
> ==================================================
>
>
>   Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
>
>   Notable changes since 1.6 RC1
>
>
>       Spark Streaming
>
>   * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>     |trackStateByKey| has been renamed to |mapWithState|
>
>
>       Spark SQL
>
>   * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
>     SPARK-12189
>     <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs in
>     eviction of storage memory by execution.
>   * SPARK-12258
>     <https://issues.apache.org/jira/browse/SPARK-12258> correct
>     passing null into ScalaUDF
>
>
>     Notable Features Since 1.5
>
>
>       Spark SQL
>
>   * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
>     Parquet Performance - Improve Parquet scan performance when using
>     flat schemas.
>   * SPARK-10810
>     <https://issues.apache.org/jira/browse/SPARK-10810>Session
>     Management - Isolated devault database (i.e |USE mydb|) even on
>     shared clusters.
>   * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
>     Dataset API - A type-safe API (similar to RDDs) that performs many
>     operations on serialized binary data and code generation (i.e.
>     Project Tungsten).
>   * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
>     Unified Memory Management - Shared memory for execution and
>     caching instead of exclusive division of the regions.
>   * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197>
>     SQL Queries on Files - Concise syntax for running SQL queries over
>     files of any supported format without registering a table.
>   * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
>     Reading non-standard JSON files - Added options to read
>     non-standard JSON files (e.g. single-quotes, unquoted attributes)
>   * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
>     Per-operator Metrics for SQL Execution - Display statistics on a
>     peroperator basis for memory usage and spilled data size.
>   * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329>
>     Star (*) expansion for StructTypes - Makes it easier to nest and
>     unest arbitrary numbers of columns
>   * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
>     SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
>     In-memory Columnar Cache Performance - Significant (up to 14x)
>     speed up when caching data that contains complex types in
>     DataFrames or SQL.
>   * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111>
>     Fast null-safe joins - Joins using null-safe equality (|<=>|) will
>     now execute using SortMergeJoin instead of computing a cartisian
>     product.
>   * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389>
>     SQL Execution Using Off-Heap Memory - Support for configuring
>     query execution to occur using off-heap memory to avoid GC overhead
>   * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
>     Datasource API Avoid Double Filter - When implemeting a datasource
>     with filter pushdown, developers can now tell Spark SQL to avoid
>     double evaluating a pushed-down filter.
>   * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
>     Advanced Layout of Cached Data - storing partitioning and ordering
>     schemes in In-memory table scan, and adding distributeBy and
>     localSort to DF API
>   * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
>     Adaptive query execution - Intial support for automatically
>     selecting the number of reducers for joins and aggregations.
>   * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
>     Improved query planner for queries having distinct aggregations -
>     Query plans of distinct aggregations are more robust when distinct
>     columns have high cardinality.
>
>
>       Spark Streaming
>
>   * API Updates
>       o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
>         New improved state management - |mapWithState| - a DStream
>         transformation for stateful stream processing, supercedes
>         |updateStateByKey| in functionality and performance.
>       o SPARK-11198
>         <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
>         record deaggregation - Kinesis streams have been upgraded to
>         use KCL 1.4.0 and supports transparent deaggregation of
>         KPL-aggregated records.
>       o SPARK-10891
>         <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
>         message handler function - Allows arbitraray function to be
>         applied to a Kinesis record in the Kinesis receiver before to
>         customize what data is to be stored in memory.
>       o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
>         Python Streamng Listener API - Get streaming statistics
>         (scheduling delays, batch processing times, etc.) in streaming.
>
>   * UI Improvements
>       o Made failures visible in the streaming tab, in the timelines,
>         batch list, and batch details page.
>       o Made output operations visible in the streaming tab as
>         progress bars.
>
>
>       MLlib
>
>
>         New algorithms/models
>
>   * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
>     Survival analysis - Log-linear model for survival analysis
>   * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834>
>     Normal equation for least squares - Normal equation solver,
>     providing R-like model summary statistics
>   * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147>
>     Online hypothesis testing - A/B testing in the Spark Streaming
>     framework
>   * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
>     feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
>     transformer
>   * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
>     Bisecting K-Means clustering - Fast top-down clustering variant of
>     K-Means
>
>
>         API improvements
>
>   * ML Pipelines
>       o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
>         Pipeline persistence - Save/load for ML Pipelines, with
>         partial coverage of spark.ml <http://spark.ml/>algorithms
>       o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
>         LDA in ML Pipelines - API for Latent Dirichlet Allocation in
>         ML Pipelines
>   * R API
>       o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
>         R-like statistics for GLMs - (Partial) R-like stats for
>         ordinary least squares via summary(model)
>       o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
>         Feature interactions in R formula - Interaction operator ":"
>         in R formula
>   * Python API - Many improvements to Python API to approach feature
>     parity
>
>
>         Misc improvements
>
>   * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
>     SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
>     Instance weights for GLMs - Logistic and Linear Regression can
>     take instance weights
>   * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>     SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
>     Univariate and bivariate statistics in DataFrames - Variance,
>     stddev, correlations, etc.
>   * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
>     LIBSVM data source - LIBSVM as a SQL data source
>
>
>             Documentation improvements
>
>   * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751>
>     @since versions - Documentation includes initial version when
>     classes and methods were added
>   * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
>     Testable example code - Automated testing for code in user guide
>     examples
>
>
>     Deprecations
>
>   * In spark.mllib.clustering.KMeans, the "runs" parameter has been
>     deprecated.
>   * In spark.ml.classification.LogisticRegressionModel and
>     spark.ml.regression.LinearRegressionModel, the "weights" field has
>     been deprecated, in favor of the new name "coefficients." This
>     helps disambiguate from instance (row) weights given to algorithms.
>
>
>     Changes of behavior
>
>   * spark.mllib.tree.GradientBoostedTrees validationTol has changed
>     semantics in 1.6. Previously, it was a threshold for absolute
>     change in error. Now, it resembles the behavior of GradientDescent
>     convergenceTol: For large errors, it uses relative error (relative
>     to the previous error); for small errors (< 0.01), it uses
>     absolute error.
>   * spark.ml.feature.RegexTokenizer: Previously, it did not convert
>     strings to lowercase before tokenizing. Now, it converts to
>     lowercase by default, with an option not to. This matches the
>     behavior of the simpler Tokenizer transformer.
>   * Spark SQL's partition discovery has been changed to only discover
>     partition directories that are children of the given path. (i.e.
>     if |path="/my/data/x=1"| then |x=1| will no longer be considered a
>     partition but only children of |x=1|.) This behavior can be
>     overridden by manually specifying the |basePath| that partitioning
>     discovery should start with (SPARK-11678
>     <https://issues.apache.org/jira/browse/SPARK-11678>).
>   * When casting a value of an integral type to timestamp (e.g.
>     casting a long value to timestamp), the value is treated as being
>     in seconds instead of milliseconds (SPARK-11724
>     <https://issues.apache.org/jira/browse/SPARK-11724>).
>   * With the improved query planner for queries having distinct
>     aggregations (SPARK-9241
>     <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>     query having a single distinct aggregation has been changed to a
>     more robust version. To switch back to the plan generated by Spark
>     1.5's planner, please set
>     |spark.sql.specializeSingleDistinctAggPlanning| to
>     |true| (SPARK-12077
>     <https://issues.apache.org/jira/browse/SPARK-12077>).
>


Mime
View raw message