spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ches...@alpinenow.com
Subject Re: [VOTE] Release Apache Spark 1.5.0 (RC2)
Date Tue, 01 Sep 2015 14:13:47 GMT
Thanks for the explanation. Since 1.5.0 rc3 is not yet released, I assume it would cut from
1.5 branch, doesn't that bring 1.5.1 snapshot code ? 

The reason I am asking these questions is that I would like to know If I want build 1.5.0
 myself, which commit should I use ? 

Sent from my iPad

> On Sep 1, 2015, at 6:57 AM, Sean Owen <sowen@cloudera.com> wrote:
> 
> The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah
> technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is
> released. In practice I think it's simpler to follow the defaults of
> the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after
> any 1.5.0-rc is released. It doesn't affect later RCs. This has
> nothing to do with what commits go into 1.5.0; it's an ignorable
> detail of the version in POMs in the source tree, which don't mean
> much anyway as the source tree itself is not a released version.
> 
>> On Tue, Sep 1, 2015 at 2:48 PM,  <chester@alpinenow.com> wrote:
>> Sorry, I am still not follow. I assume the release would build from 1.5.0 before
moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from 1.5.1 snapshot during release
? Or 1.5.0 rc3 would build from the last commit of 1.5.0 (before changing to 1.5.1 snapshot)
?
>> 
>> 
>> 
>> Sent from my iPad
>> 
>>> On Sep 1, 2015, at 1:52 AM, Sean Owen <sowen@cloudera.com> wrote:
>>> 
>>> That's correct for the 1.5 branch, right? this doesn't mean that the
>>> next RC would have this value. You choose the release version during
>>> the release process.
>>> 
>>>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <chester@alpinenow.com>
wrote:
>>>> Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT,
>>>> 
>>>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
>>>> 
>>>> Chester
>>>> 
>>>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rxin@databricks.com>
wrote:
>>>>> 
>>>>> I'm going to -1 the release myself since the issue @yhuai identified
is
>>>>> pretty serious. It basically OOMs the driver for reading any files with
a
>>>>> large number of partitions. Looks like the patch for that has already
been
>>>>> merged.
>>>>> 
>>>>> I'm going to cut rc3 momentarily.
>>>>> 
>>>>> 
>>>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sandy.ryza@cloudera.com>
>>>>> wrote:
>>>>>> 
>>>>>> +1 (non-binding)
>>>>>> built from source and ran some jobs against YARN
>>>>>> 
>>>>>> -Sandy
>>>>>> 
>>>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <vaquar.khan@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Vaquar khan
>>>>>>> 
>>>>>>> +1 (non-binding, of course)
>>>>>>> 
>>>>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>>>>>    mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>>>>> 2. Tested pyspark, mllib
>>>>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>>>>> 2.3. Decision Tree, Naive Bayes OK
>>>>>>> 2.4. KMeans OK
>>>>>>>      Center And Scale OK
>>>>>>> 2.5. RDD operations OK
>>>>>>>     State of the Union Texts - MapReduce, Filter,sortByKey (word
>>>>>>> count)
>>>>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>>>      Model evaluation/optimization (rank, numIter, lambda) with
>>>>>>> itertools OK
>>>>>>> 3. Scala - MLlib
>>>>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>> 3.2. LinearRegressionWithSGD OK
>>>>>>> 3.3. Decision Tree OK
>>>>>>> 3.4. KMeans OK
>>>>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>>>>> 3.6. saveAsParquetFile OK
>>>>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>>>>> registerTempTable, sql OK
>>>>>>> 3.8. result = sqlContext.sql("SELECT
>>>>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM
Orders INNER
>>>>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID")
OK
>>>>>>> 4.0. Spark SQL from Python OK
>>>>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State
= 'WA'")
>>>>>>> OK
>>>>>>> 5.0. Packages
>>>>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t
work. But
>>>>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>>>>> 6.0. DataFrames
>>>>>>> 6.1. cast,dtypes OK
>>>>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>>>>> 6.3. joins,sql,set operations,udf OK
>>>>>>> 
>>>>>>> Cheers
>>>>>>> <k/>
>>>>>>> 
>>>>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rxin@databricks.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Please vote on releasing the following candidate as Apache
Spark
>>>>>>>> version 1.5.0. The vote is open until Friday, Aug 29, 2015
at 5:00 UTC and
>>>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>> 
>>>>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>> 
>>>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The tag to be voted on is v1.5.0-rc2:
>>>>>>>> 
>>>>>>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>>>>>> 
>>>>>>>> The release files, including signatures, digests, etc. can
be found at:
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>>>>>> 
>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>> 
>>>>>>>> The staging repository for this release (published as 1.5.0-rc2)
can be
>>>>>>>> found at:
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>>>>>> 
>>>>>>>> The staging repository for this release (published as 1.5.0)
can be
>>>>>>>> found at:
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>>>>>> 
>>>>>>>> The documentation corresponding to this release can be found
at:
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> =======================================
>>>>>>>> How can I help test this release?
>>>>>>>> =======================================
>>>>>>>> If you are a Spark user, you can help us test this release
by taking an
>>>>>>>> existing Spark workload and running on this release candidate,
then
>>>>>>>> reporting any regressions.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ================================================
>>>>>>>> What justifies a -1 vote for this release?
>>>>>>>> ================================================
>>>>>>>> This vote is happening towards the end of the 1.5 QA period,
so -1
>>>>>>>> votes should only occur for significant regressions from
1.4. Bugs already
>>>>>>>> present in 1.4, minor regressions, or bugs related to new
features will not
>>>>>>>> block this release.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ===============================================================
>>>>>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>>>>>> ===============================================================
>>>>>>>> 1. It is OK for documentation patches to target 1.5.0 and
still go into
>>>>>>>> branch-1.5, since documentations will be packaged separately
from the
>>>>>>>> release.
>>>>>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>>>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or
drop the
>>>>>>>> target version.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ==================================================
>>>>>>>> Major changes to help you focus your testing
>>>>>>>> ==================================================
>>>>>>>> 
>>>>>>>> As of today, Spark 1.5 contains more than 1000 commits from
220+
>>>>>>>> contributors. I've curated a list of important changes for
1.5. For the
>>>>>>>> complete list, please refer to Apache JIRA changelog.
>>>>>>>> 
>>>>>>>> RDD/DataFrame/SQL APIs
>>>>>>>> 
>>>>>>>> - New UDAF interface
>>>>>>>> - DataFrame hints for broadcast join
>>>>>>>> - expr function for turning a SQL expression into DataFrame
column
>>>>>>>> - Improved support for NaN values
>>>>>>>> - StructType now supports ordering
>>>>>>>> - TimestampType precision is reduced to 1us
>>>>>>>> - 100 new built-in expressions, including date/time, string,
math
>>>>>>>> - memory and local disk only checkpointing
>>>>>>>> 
>>>>>>>> DataFrame/SQL Backend Execution
>>>>>>>> 
>>>>>>>> - Code generation on by default
>>>>>>>> - Improved join, aggregation, shuffle, sorting with cache
friendly
>>>>>>>> algorithms and external algorithms
>>>>>>>> - Improved window function performance
>>>>>>>> - Better metrics instrumentation and reporting for DF/SQL
execution
>>>>>>>> plans
>>>>>>>> 
>>>>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>>>>> 
>>>>>>>> - Dynamic allocation support in all resource managers (Mesos,
YARN,
>>>>>>>> Standalone)
>>>>>>>> - Improved Mesos support (framework authentication, roles,
dynamic
>>>>>>>> allocation, constraints)
>>>>>>>> - Improved YARN support (dynamic allocation with preferred
locations)
>>>>>>>> - Improved Hive support (metastore partition pruning, metastore
>>>>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
>>>>>>>> - Support persisting data in Hive compatible format in metastore
>>>>>>>> - Support data partitioning for JSON data sources
>>>>>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown,
faster
>>>>>>>> metadata discovery and schema merging, support reading non-standard
legacy
>>>>>>>> Parquet files generated by other libraries)
>>>>>>>> - Faster and more robust dynamic partition insert
>>>>>>>> - DataSourceRegister interface for external data sources
to specify
>>>>>>>> short names
>>>>>>>> 
>>>>>>>> SparkR
>>>>>>>> 
>>>>>>>> - YARN cluster mode in R
>>>>>>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net
>>>>>>>> regularization
>>>>>>>> - Improved error messages
>>>>>>>> - Aliases to make DataFrame functions more R-like
>>>>>>>> 
>>>>>>>> Streaming
>>>>>>>> 
>>>>>>>> - Backpressure for handling bursty input streams.
>>>>>>>> - Improved Python support for streaming sources (Kafka offsets,
>>>>>>>> Kinesis, MQTT, Flume)
>>>>>>>> - Improved Python streaming machine learning algorithms (K-Means,
>>>>>>>> linear regression, logistic regression)
>>>>>>>> - Native reliable Kinesis stream support
>>>>>>>> - Input metadata like Kafka offsets made visible in the batch
details
>>>>>>>> UI
>>>>>>>> - Better load balancing and scheduling of receivers across
cluster
>>>>>>>> - Include streaming storage in web UI
>>>>>>>> 
>>>>>>>> Machine Learning and Advanced Analytics
>>>>>>>> 
>>>>>>>> - Feature transformers: CountVectorizer, Discrete Cosine
>>>>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover,
and
>>>>>>>> VectorSlicer.
>>>>>>>> - Estimators under pipeline APIs: naive Bayes, k-means, and
isotonic
>>>>>>>> regression.
>>>>>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan
for
>>>>>>>> sequential pattern mining, association rule generation, 1-sample
>>>>>>>> Kolmogorov-Smirnov test.
>>>>>>>> - Improvements to existing algorithms: LDA, trees/ensembles,
GMMs
>>>>>>>> - More efficient Pregel API implementation for GraphX
>>>>>>>> - Model summary for linear and logistic regression.
>>>>>>>> - Python API: distributed matrices, streaming k-means and
linear
>>>>>>>> models, LDA, power iteration clustering, etc.
>>>>>>>> - Tuning and evaluation: train-validation split and multiclass
>>>>>>>> classification evaluator.
>>>>>>>> - Documentation: document the release version of public API
methods
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message