spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Graves <>
Subject Re: [VOTE] Release Apache Spark 1.5.0 (RC3)
Date Fri, 04 Sep 2015 14:30:08 GMT
The upper/lower case thing is known. assume
it was decided to be ok and its going to be in the release notes  but Reynold or Josh can
probably speak to it more.

     On Thursday, September 3, 2015 10:21 PM, Krishna Sankar <> wrote:

1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min      mvn clean package -Pyarn
-Phadoop-2.6 -DskipTests2. Tested pyspark, mllib2.1. statistics (min,max,mean,Pearson,Spearman)
OK2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK2.4. KMeans OK 
     Center And Scale OK2.5. RDD operations OK      State of the Union Texts - MapReduce,
Filter,sortByKey (word count)2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK 
     Model evaluation/optimization (rank, numIter, lambda) with itertools OK3. Scala -
MLlib3.1. statistics (min,max,mean,Pearson,Spearman) OK3.2. LinearRegressionWithSGD OK3.3.
Decision Tree OK3.4. KMeans OK3.5. Recommendation (Movielens medium dataset ~1 M ratings)
OK3.6. saveAsParquetFile OK3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK3.8. result = sqlContext.sql("SELECT OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount
FROM Orders INNER JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK4.0. Spark
SQL from Python OK4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
OK5.0. Packages5.1. com.databricks.spark.csv - read/write OK(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11
didn’t work. But com.databricks:spark-csv_2.11:1.2.0 worked)6.0. DataFrames 6.1. cast,dtypes
OK6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK6.3. All joins,sql,set operations,udf OK
Two Problems:
1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’,
now ‘avg(Total)’; previously 'AVG(Total)'). So programs that depend on the case of the
synthetic column names would fail.2. orders_3.groupBy("Year","Month").sum('Total').show() 
  fails with the error ‘ Unable to acquire 4194304 bytes of memory’ 
  orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with the same error 
  Is this a known bug ?Cheers<k/>P.S: Sorry for the spam, forgot Reply All 
On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <> wrote:

Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is
open until Friday, Sep 4, 2015 at 21:00 UTC and passes if a majority of at least 3 +1 PMC
votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.0[ ] -1 Do not release this package because
To learn more about Apache Spark, please see

The tag to be voted on is v1.5.0-rc3:
The release files, including signatures, digests, etc. can be found at:
Release artifacts are signed with the following key:
The staging repository for this release (published as 1.5.0-rc3) can be found at:
The staging repository for this release (published as 1.5.0) can be found at:
The documentation corresponding to this release can be found at:

=======================================How can I help test this release?=======================================If
you are a Spark user, you can help us test this release by taking an existing Spark workload
and running on this release candidate, then reporting any regressions.

================================================What justifies a -1 vote for this release?================================================This
vote is happening towards the end of the 1.5 QA period, so -1 votes should only occur for
significant regressions from 1.4. Bugs already present in 1.4, minor regressions, or bugs
related to new features will not block this release.

===============================================================What should happen to JIRA
tickets still targeting 1.5.0?===============================================================1.
It is OK for documentation patches to target 1.5.0 and still go into branch-1.5, since documentations
will be packaged separately from the release.2. New features for non-alpha-modules should
target 1.6+.3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target version.

==================================================Major changes to help you focus your testing==================================================
As of today, Spark 1.5 contains more than 1000 commits from 220+ contributors. I've curated
a list of important changes for 1.5. For the complete list, please refer to Apache JIRA changelog.
RDD/DataFrame/SQL APIs
- New UDAF interface- DataFrame hints for broadcast join- expr function for turning a SQL
expression into DataFrame column- Improved support for NaN values- StructType now supports
ordering- TimestampType precision is reduced to 1us- 100 new built-in expressions, including
date/time, string, math- memory and local disk only checkpointing
DataFrame/SQL Backend Execution
- Code generation on by default- Improved join, aggregation, shuffle, sorting with cache friendly
algorithms and external algorithms- Improved window function performance- Better metrics instrumentation
and reporting for DF/SQL execution plans
Data Sources, Hive, Hadoop, Mesos and Cluster Management
- Dynamic allocation support in all resource managers (Mesos, YARN, Standalone)- Improved
Mesos support (framework authentication, roles, dynamic allocation, constraints)- Improved
YARN support (dynamic allocation with preferred locations)- Improved Hive support (metastore
partition pruning, metastore connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)- Support
persisting data in Hive compatible format in metastore- Support data partitioning for JSON
data sources- Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata discovery
and schema merging, support reading non-standard legacy Parquet files generated by other libraries)-
Faster and more robust dynamic partition insert- DataSourceRegister interface for external
data sources to specify short names
- YARN cluster mode in R- GLMs with R formula, binomial/Gaussian families, and elastic-net
regularization- Improved error messages- Aliases to make DataFrame functions more R-like
- Backpressure for handling bursty input streams.- Improved Python support for streaming sources
(Kafka offsets, Kinesis, MQTT, Flume)- Improved Python streaming machine learning algorithms
(K-Means, linear regression, logistic regression)- Native reliable Kinesis stream support-
Input metadata like Kafka offsets made visible in the batch details UI- Better load balancing
and scheduling of receivers across cluster- Include streaming storage in web UI
Machine Learning and Advanced Analytics
- Feature transformers: CountVectorizer, Discrete Cosine transformation, MinMaxScaler, NGram,
PCA, RFormula, StopWordsRemover, and VectorSlicer.- Estimators under pipeline APIs: naive
Bayes, k-means, and isotonic regression.- Algorithms: multilayer perceptron classifier, PrefixSpan
for sequential pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov test.-
Improvements to existing algorithms: LDA, trees/ensembles, GMMs- More efficient Pregel API
implementation for GraphX- Model summary for linear and logistic regression.- Python API:
distributed matrices, streaming k-means and linear models, LDA, power iteration clustering,
etc.- Tuning and evaluation: train-validation split and multiclass classification evaluator.-
Documentation: document the release version of public API methods

View raw message