spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chester Chen <ches...@alpinenow.com>
Subject Re: [VOTE] Release Apache Spark 1.5.0 (RC2)
Date Tue, 01 Sep 2015 14:22:35 GMT
Thanks Sean, that make it clear.

On Tue, Sep 1, 2015 at 7:17 AM, Sean Owen <sowen@cloudera.com> wrote:

> Any 1.5 RC comes from the latest state of the 1.5 branch at some point
> in time. The next RC will be cut from whatever the latest commit is.
> You can see the tags in git for the specific commits for each RC.
> There's no such thing as "1.5.1 SNAPSHOT" commits, just commits to
> branch 1.5. I would ignore the "SNAPSHOT" version for your purpose.
>
> You can always build from the exact commit that an RC did by looking
> at tags. There is no 1.5.0 yet so you can't build that, but once it's
> released, you would be able to find its tag as well. You can always
> build the latest 1.5.x branch by building from HEAD of that branch.
>
> On Tue, Sep 1, 2015 at 3:13 PM,  <chester@alpinenow.com> wrote:
> > Thanks for the explanation. Since 1.5.0 rc3 is not yet released, I
> assume it would cut from 1.5 branch, doesn't that bring 1.5.1 snapshot code
> ?
> >
> > The reason I am asking these questions is that I would like to know If I
> want build 1.5.0  myself, which commit should I use ?
> >
> > Sent from my iPad
> >
> >> On Sep 1, 2015, at 6:57 AM, Sean Owen <sowen@cloudera.com> wrote:
> >>
> >> The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah
> >> technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is
> >> released. In practice I think it's simpler to follow the defaults of
> >> the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after
> >> any 1.5.0-rc is released. It doesn't affect later RCs. This has
> >> nothing to do with what commits go into 1.5.0; it's an ignorable
> >> detail of the version in POMs in the source tree, which don't mean
> >> much anyway as the source tree itself is not a released version.
> >>
> >>> On Tue, Sep 1, 2015 at 2:48 PM,  <chester@alpinenow.com> wrote:
> >>> Sorry, I am still not follow. I assume the release would build from
> 1.5.0 before moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from
> 1.5.1 snapshot during release ? Or 1.5.0 rc3 would build from the last
> commit of 1.5.0 (before changing to 1.5.1 snapshot) ?
> >>>
> >>>
> >>>
> >>> Sent from my iPad
> >>>
> >>>> On Sep 1, 2015, at 1:52 AM, Sean Owen <sowen@cloudera.com> wrote:
> >>>>
> >>>> That's correct for the 1.5 branch, right? this doesn't mean that the
> >>>> next RC would have this value. You choose the release version during
> >>>> the release process.
> >>>>
> >>>>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <chester@alpinenow.com>
> wrote:
> >>>>> Seems that Github branch-1.5 already changing the version to
> 1.5.1-SNAPSHOT,
> >>>>>
> >>>>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1
?
> >>>>>
> >>>>> Chester
> >>>>>
> >>>>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rxin@databricks.com>
> wrote:
> >>>>>>
> >>>>>> I'm going to -1 the release myself since the issue @yhuai
> identified is
> >>>>>> pretty serious. It basically OOMs the driver for reading any
files
> with a
> >>>>>> large number of partitions. Looks like the patch for that has
> already been
> >>>>>> merged.
> >>>>>>
> >>>>>> I'm going to cut rc3 momentarily.
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <
> sandy.ryza@cloudera.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> +1 (non-binding)
> >>>>>>> built from source and ran some jobs against YARN
> >>>>>>>
> >>>>>>> -Sandy
> >>>>>>>
> >>>>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <
> vaquar.khan@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Vaquar khan
> >>>>>>>>
> >>>>>>>> +1 (non-binding, of course)
> >>>>>>>>
> >>>>>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36
min
> >>>>>>>>    mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> >>>>>>>> 2. Tested pyspark, mllib
> >>>>>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> >>>>>>>> 2.2. Linear/Ridge/Laso Regression OK
> >>>>>>>> 2.3. Decision Tree, Naive Bayes OK
> >>>>>>>> 2.4. KMeans OK
> >>>>>>>>      Center And Scale OK
> >>>>>>>> 2.5. RDD operations OK
> >>>>>>>>     State of the Union Texts - MapReduce, Filter,sortByKey
(word
> >>>>>>>> count)
> >>>>>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings)
OK
> >>>>>>>>      Model evaluation/optimization (rank, numIter, lambda)
with
> >>>>>>>> itertools OK
> >>>>>>>> 3. Scala - MLlib
> >>>>>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> >>>>>>>> 3.2. LinearRegressionWithSGD OK
> >>>>>>>> 3.3. Decision Tree OK
> >>>>>>>> 3.4. KMeans OK
> >>>>>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings)
OK
> >>>>>>>> 3.6. saveAsParquetFile OK
> >>>>>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> >>>>>>>> registerTempTable, sql OK
> >>>>>>>> 3.8. result = sqlContext.sql("SELECT
> >>>>>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount
FROM
> Orders INNER
> >>>>>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID")
OK
> >>>>>>>> 4.0. Spark SQL from Python OK
> >>>>>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE
State =
> 'WA'")
> >>>>>>>> OK
> >>>>>>>> 5.0. Packages
> >>>>>>>> 5.1. com.databricks.spark.csv - read/write OK
> >>>>>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11
didn’t
> work. But
> >>>>>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
> >>>>>>>> 6.0. DataFrames
> >>>>>>>> 6.1. cast,dtypes OK
> >>>>>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> >>>>>>>> 6.3. joins,sql,set operations,udf OK
> >>>>>>>>
> >>>>>>>> Cheers
> >>>>>>>> <k/>
> >>>>>>>>
> >>>>>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rxin@databricks.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Please vote on releasing the following candidate
as Apache Spark
> >>>>>>>>> version 1.5.0. The vote is open until Friday, Aug
29, 2015 at
> 5:00 UTC and
> >>>>>>>>> passes if a majority of at least 3 +1 PMC votes
are cast.
> >>>>>>>>>
> >>>>>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
> >>>>>>>>> [ ] -1 Do not release this package because ...
> >>>>>>>>>
> >>>>>>>>> To learn more about Apache Spark, please see
> http://spark.apache.org/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The tag to be voted on is v1.5.0-rc2:
> >>>>>>>>>
> >>>>>>>>>
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
> >>>>>>>>>
> >>>>>>>>> The release files, including signatures, digests,
etc. can be
> found at:
> >>>>>>>>>
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
> >>>>>>>>>
> >>>>>>>>> Release artifacts are signed with the following
key:
> >>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
> >>>>>>>>>
> >>>>>>>>> The staging repository for this release (published
as 1.5.0-rc2)
> can be
> >>>>>>>>> found at:
> >>>>>>>>>
> https://repository.apache.org/content/repositories/orgapachespark-1141/
> >>>>>>>>>
> >>>>>>>>> The staging repository for this release (published
as 1.5.0) can
> be
> >>>>>>>>> found at:
> >>>>>>>>>
> https://repository.apache.org/content/repositories/orgapachespark-1140/
> >>>>>>>>>
> >>>>>>>>> The documentation corresponding to this release
can be found at:
> >>>>>>>>>
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> =======================================
> >>>>>>>>> How can I help test this release?
> >>>>>>>>> =======================================
> >>>>>>>>> If you are a Spark user, you can help us test this
release by
> taking an
> >>>>>>>>> existing Spark workload and running on this release
candidate,
> then
> >>>>>>>>> reporting any regressions.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ================================================
> >>>>>>>>> What justifies a -1 vote for this release?
> >>>>>>>>> ================================================
> >>>>>>>>> This vote is happening towards the end of the 1.5
QA period, so
> -1
> >>>>>>>>> votes should only occur for significant regressions
from 1.4.
> Bugs already
> >>>>>>>>> present in 1.4, minor regressions, or bugs related
to new
> features will not
> >>>>>>>>> block this release.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ===============================================================
> >>>>>>>>> What should happen to JIRA tickets still targeting
1.5.0?
> >>>>>>>>> ===============================================================
> >>>>>>>>> 1. It is OK for documentation patches to target
1.5.0 and still
> go into
> >>>>>>>>> branch-1.5, since documentations will be packaged
separately
> from the
> >>>>>>>>> release.
> >>>>>>>>> 2. New features for non-alpha-modules should target
1.6+.
> >>>>>>>>> 3. Non-blocker bug fixes should target 1.5.1 or
1.6.0, or drop
> the
> >>>>>>>>> target version.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ==================================================
> >>>>>>>>> Major changes to help you focus your testing
> >>>>>>>>> ==================================================
> >>>>>>>>>
> >>>>>>>>> As of today, Spark 1.5 contains more than 1000 commits
from 220+
> >>>>>>>>> contributors. I've curated a list of important changes
for 1.5.
> For the
> >>>>>>>>> complete list, please refer to Apache JIRA changelog.
> >>>>>>>>>
> >>>>>>>>> RDD/DataFrame/SQL APIs
> >>>>>>>>>
> >>>>>>>>> - New UDAF interface
> >>>>>>>>> - DataFrame hints for broadcast join
> >>>>>>>>> - expr function for turning a SQL expression into
DataFrame
> column
> >>>>>>>>> - Improved support for NaN values
> >>>>>>>>> - StructType now supports ordering
> >>>>>>>>> - TimestampType precision is reduced to 1us
> >>>>>>>>> - 100 new built-in expressions, including date/time,
string, math
> >>>>>>>>> - memory and local disk only checkpointing
> >>>>>>>>>
> >>>>>>>>> DataFrame/SQL Backend Execution
> >>>>>>>>>
> >>>>>>>>> - Code generation on by default
> >>>>>>>>> - Improved join, aggregation, shuffle, sorting with
cache
> friendly
> >>>>>>>>> algorithms and external algorithms
> >>>>>>>>> - Improved window function performance
> >>>>>>>>> - Better metrics instrumentation and reporting for
DF/SQL
> execution
> >>>>>>>>> plans
> >>>>>>>>>
> >>>>>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
> >>>>>>>>>
> >>>>>>>>> - Dynamic allocation support in all resource managers
(Mesos,
> YARN,
> >>>>>>>>> Standalone)
> >>>>>>>>> - Improved Mesos support (framework authentication,
roles,
> dynamic
> >>>>>>>>> allocation, constraints)
> >>>>>>>>> - Improved YARN support (dynamic allocation with
preferred
> locations)
> >>>>>>>>> - Improved Hive support (metastore partition pruning,
metastore
> >>>>>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade
to 1.2)
> >>>>>>>>> - Support persisting data in Hive compatible format
in metastore
> >>>>>>>>> - Support data partitioning for JSON data sources
> >>>>>>>>> - Parquet improvements (upgrade to 1.7, predicate
pushdown,
> faster
> >>>>>>>>> metadata discovery and schema merging, support reading
> non-standard legacy
> >>>>>>>>> Parquet files generated by other libraries)
> >>>>>>>>> - Faster and more robust dynamic partition insert
> >>>>>>>>> - DataSourceRegister interface for external data
sources to
> specify
> >>>>>>>>> short names
> >>>>>>>>>
> >>>>>>>>> SparkR
> >>>>>>>>>
> >>>>>>>>> - YARN cluster mode in R
> >>>>>>>>> - GLMs with R formula, binomial/Gaussian families,
and
> elastic-net
> >>>>>>>>> regularization
> >>>>>>>>> - Improved error messages
> >>>>>>>>> - Aliases to make DataFrame functions more R-like
> >>>>>>>>>
> >>>>>>>>> Streaming
> >>>>>>>>>
> >>>>>>>>> - Backpressure for handling bursty input streams.
> >>>>>>>>> - Improved Python support for streaming sources
(Kafka offsets,
> >>>>>>>>> Kinesis, MQTT, Flume)
> >>>>>>>>> - Improved Python streaming machine learning algorithms
(K-Means,
> >>>>>>>>> linear regression, logistic regression)
> >>>>>>>>> - Native reliable Kinesis stream support
> >>>>>>>>> - Input metadata like Kafka offsets made visible
in the batch
> details
> >>>>>>>>> UI
> >>>>>>>>> - Better load balancing and scheduling of receivers
across
> cluster
> >>>>>>>>> - Include streaming storage in web UI
> >>>>>>>>>
> >>>>>>>>> Machine Learning and Advanced Analytics
> >>>>>>>>>
> >>>>>>>>> - Feature transformers: CountVectorizer, Discrete
Cosine
> >>>>>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula,
> StopWordsRemover, and
> >>>>>>>>> VectorSlicer.
> >>>>>>>>> - Estimators under pipeline APIs: naive Bayes, k-means,
and
> isotonic
> >>>>>>>>> regression.
> >>>>>>>>> - Algorithms: multilayer perceptron classifier,
PrefixSpan for
> >>>>>>>>> sequential pattern mining, association rule generation,
1-sample
> >>>>>>>>> Kolmogorov-Smirnov test.
> >>>>>>>>> - Improvements to existing algorithms: LDA, trees/ensembles,
GMMs
> >>>>>>>>> - More efficient Pregel API implementation for GraphX
> >>>>>>>>> - Model summary for linear and logistic regression.
> >>>>>>>>> - Python API: distributed matrices, streaming k-means
and linear
> >>>>>>>>> models, LDA, power iteration clustering, etc.
> >>>>>>>>> - Tuning and evaluation: train-validation split
and multiclass
> >>>>>>>>> classification evaluator.
> >>>>>>>>> - Documentation: document the release version of
public API
> methods
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
>

Mime
View raw message