spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: Thoughts on Spark 3 release, or a preview release
Date Fri, 13 Sep 2019 20:07:20 GMT
+1 for a preview release.

DSv2 is quite close to being ready. I can only think of a couple issues
that we need to merge, like getting a fix for stats estimation done. I'll
have a better idea once I've caught up from being away for ApacheCon and
I'll add this to the agenda for our next DSv2 sync on Wednesday.

On Fri, Sep 13, 2019 at 12:26 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
wrote:

> Ur, Sean.
>
> I prefer a full release like 2.0.0-preview.
>
> https://archive.apache.org/dist/spark/spark-2.0.0-preview/
>
> And, thank you, Xingbo!
> Could you take a look at website generation? It seems to be broken on
> `master`.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Sep 13, 2019 at 11:30 AM Xingbo Jiang <jiangxb1987@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I would like to volunteer to be the release manager of Spark 3 preview,
>> thanks!
>>
>> Sean Owen <srowen@gmail.com> 于2019年9月13日周五 上午11:21写道:
>>
>>> Well, great to hear the unanimous support for a Spark 3 preview
>>> release. Now, I don't know how to make releases myself :) I would
>>> first open it up to our revered release managers: would anyone be
>>> interested in trying to make one? sounds like it's not too soon to get
>>> what's in master out for evaluation, as there aren't any major
>>> deficiencies left, although a number of items to consider for the
>>> final release.
>>>
>>> I think we just need one release, targeting Hadoop 3.x / Hive 2.x in
>>> order to make it possible to test with JDK 11. (We're only on Scala
>>> 2.12 at this point.)
>>>
>>> On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin <rxin@databricks.com> wrote:
>>> >
>>> > +1! Long due for a preview release.
>>> >
>>> >
>>> > On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <holden@pigscanfly.ca>
>>> wrote:
>>> >>
>>> >> I like the idea from the PoV of giving folks something to start
>>> testing against and exploring so they can raise issues with us earlier in
>>> the process and we have more time to make calls around this.
>>> >>
>>> >> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jzhuge@apache.org>
wrote:
>>> >>>
>>> >>> +1  Like the idea as a user and a DSv2 contributor.
>>> >>>
>>> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <kabhwan@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>> +1 (as a contributor) from me to have preview release on Spark
3 as
>>> it would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we are
>>> intended to introduce new features before official release, that should
>>> work regardless of this, but if we are intended to have opportunity to test
>>> earlier, ideally it should.
>>> >>>>
>>> >>>> As a one of contributors in structured streaming area, I'd like
to
>>> add some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>> >>>>
>>> >>>> > must be done
>>> >>>> * SPARK-26154 Stream-stream joins - left outer join gives
>>> inconsistent output
>>> >>>> It's a correctness issue with multiple users reported, being
>>> reported at Nov. 2018. There's a way to reproduce it consistently, and we
>>> have a patch submitted at Jan. 2019 to fix it.
>>> >>>>
>>> >>>> > better to have
>>> >>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> >>>> * SPARK-26848 Introduce new option to Kafka source - specify
>>> timestamp to start and end offset
>>> >>>> * SPARK-20568 Delete files after processing in structured streaming
>>> >>>>
>>> >>>> There're some more new features/improvements items in SS, but
given
>>> we're talking about ramping-down, above list might be realistic one.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jgp@jgp.net>
>>> wrote:
>>> >>>>>
>>> >>>>> As a user/non committer, +1
>>> >>>>>
>>> >>>>> I love the idea of an early 3.0.0 so we can test current
dev
>>> against it, I know the final 3.x will probably need another round of
>>> testing when it gets out, but less for sure... I know I could checkout and
>>> compile, but having a “packaged” preversion is great if it does not take
>>> too much time to the team...
>>> >>>>>
>>> >>>>> jg
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gurwls223@gmail.com>
>>> wrote:
>>> >>>>>
>>> >>>>> +1 from me too but I would like to know what other people
think
>>> too.
>>> >>>>>
>>> >>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <dongjoon.hyun@gmail.com>님이
>>> 작성:
>>> >>>>>>
>>> >>>>>> Thank you, Sean.
>>> >>>>>>
>>> >>>>>> I'm also +1 for the following three.
>>> >>>>>>
>>> >>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>> >>>>>> 2. Apache Spark 3.0.0-preview in 2019
>>> >>>>>> 3. Apache Spark 3.0.0 in early 2020
>>> >>>>>>
>>> >>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview`
>>> helps it a lot.
>>> >>>>>>
>>> >>>>>> After this discussion, can we have some timeline for
`Spark 3.0
>>> Release Window` in our versioning-policy page?
>>> >>>>>>
>>> >>>>>> - https://spark.apache.org/versioning-policy.html
>>> >>>>>>
>>> >>>>>> Bests,
>>> >>>>>> Dongjoon.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <heuermh@gmail.com>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>> I would love to see Spark + Hadoop + Parquet + Avro
>>> compatibility problems resolved, e.g.
>>> >>>>>>>
>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>> >>>>>>>
>>> >>>>>>> Note that Avro is now at 1.9.1, binary-incompatible
with 1.8.x.
>>> As far as I know, Parquet has not cut a release based on this new version.
>>> >>>>>>>
>>> >>>>>>> Then out of curiosity, are the new Spark Graph APIs
targeting
>>> 3.0?
>>> >>>>>>>
>>> >>>>>>> https://github.com/apache/spark/pull/24851
>>> >>>>>>> https://github.com/apache/spark/pull/24297
>>> >>>>>>>
>>> >>>>>>>    michael
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <srowen@apache.org>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>> I'm curious what current feelings are about ramping
down towards
>>> a
>>> >>>>>>> Spark 3 release. It feels close to ready. There
is no fixed date,
>>> >>>>>>> though in the past we had informally tossed around
"back end of
>>> 2019".
>>> >>>>>>> For reference, Spark 1 was May 2014, Spark 2 was
July 2016. I'd
>>> expect
>>> >>>>>>> Spark 2 to last longer, so to speak, but feels like
Spark 3 is
>>> coming
>>> >>>>>>> due.
>>> >>>>>>>
>>> >>>>>>> What are the few major items that must get done
for Spark 3, in
>>> your
>>> >>>>>>> opinion? Below are all of the open JIRAs for 3.0
(which everyone
>>> >>>>>>> should feel free to update with things that aren't
really needed
>>> for
>>> >>>>>>> Spark 3; I already triaged some).
>>> >>>>>>>
>>> >>>>>>> For me, it's:
>>> >>>>>>> - DSv2?
>>> >>>>>>> - Finishing touches on the Hive, JDK 11 update
>>> >>>>>>>
>>> >>>>>>> What about considering a preview release earlier,
as happened for
>>> >>>>>>> Spark 2, to get feedback much earlier than the RC
cycle? Could
>>> that
>>> >>>>>>> even happen ... about now?
>>> >>>>>>>
>>> >>>>>>> I'm also wondering what a realistic estimate of
Spark 3 release
>>> is. My
>>> >>>>>>> guess is quite early 2020, from here.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> SPARK-29014 DataSourceV2: Clean up current, default,
and session
>>> catalog uses
>>> >>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with
run-tests
>>> >>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>> >>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use
TableCatalog
>>> API
>>> >>>>>>> SPARK-28588 Build a SQL reference doc
>>> >>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>> >>>>>>> SPARK-28684 Hive module support JDK 11
>>> >>>>>>> SPARK-28548 explain() shows wrong result for persisted
DataFrames
>>> >>>>>>> after some operations
>>> >>>>>>> SPARK-28372 Document Spark WEB UI
>>> >>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>> >>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>> >>>>>>> SPARK-28301 fix the behavior of table name resolution
with
>>> multi-catalog
>>> >>>>>>> SPARK-28155 do not leak SaveMode to file source
v2
>>> >>>>>>> SPARK-28103 Cannot infer filters from union table
with empty
>>> local
>>> >>>>>>> relation table properly
>>> >>>>>>> SPARK-28024 Incorrect numeric values when out of
range
>>> >>>>>>> SPARK-27936 Support local dependency uploading from
--py-files
>>> >>>>>>> SPARK-27884 Deprecate Python 2 support in Spark
3.0
>>> >>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark
SQL
>>> >>>>>>> SPARK-27780 Shuffle server & client should be
versioned to enable
>>> >>>>>>> smoother upgrade
>>> >>>>>>> SPARK-27714 Support Join Reorder based on Genetic
Algorithm when
>>> the #
>>> >>>>>>> of joined tables > 12
>>> >>>>>>> SPARK-27471 Reorganize public v2 catalog API
>>> >>>>>>> SPARK-27520 Introduce a global config system to
replace
>>> hadoopConfiguration
>>> >>>>>>> SPARK-24625 put all the backward compatible behavior
change
>>> configs
>>> >>>>>>> under spark.sql.legacy.*
>>> >>>>>>> SPARK-24640 size(null) returns null
>>> >>>>>>> SPARK-24702 Unable to cast to calendar interval
in spark sql.
>>> >>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries
for more
>>> operators
>>> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>> >>>>>>> SPARK-25083 remove the type erasure hack in data
source scan
>>> >>>>>>> SPARK-25383 Image data source supports sample pushdown
>>> >>>>>>> SPARK-27272 Enable blacklisting of node/executor
on fetch
>>> failures by default
>>> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs)
have a
>>> major
>>> >>>>>>> efficiency problem
>>> >>>>>>> SPARK-25128 multiple simultaneous job submissions
against k8s
>>> backend
>>> >>>>>>> cause driver pods to hang
>>> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted
scale
>>> configurable
>>> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>> >>>>>>> SPARK-24942 Improve cluster resource management
with jobs
>>> containing
>>> >>>>>>> barrier stage
>>> >>>>>>> SPARK-25914 Separate projection from grouping and
aggregate in
>>> logical Aggregate
>>> >>>>>>> SPARK-26022 PySpark Comparison with Pandas
>>> >>>>>>> SPARK-20964 Make some keywords reserved along with
the ANSI/SQL
>>> standard
>>> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and
metrics
>>> >>>>>>> SPARK-26425 Add more constraint checks in file streaming
source
>>> to
>>> >>>>>>> avoid checkpoint corruption
>>> >>>>>>> SPARK-25843 Redesign rangeBetween API
>>> >>>>>>> SPARK-25841 Redesign window function rangeBetween
API
>>> >>>>>>> SPARK-25752 Add trait to easily whitelist logical
operators that
>>> >>>>>>> produce named output from CleanupAliases
>>> >>>>>>> SPARK-23210 Introduce the concept of default value
to schema
>>> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped
aggregate and
>>> window aggregate
>>> >>>>>>> SPARK-25531 new write APIs for data source v2
>>> >>>>>>> SPARK-25547 Pluggable jdbc connection factory
>>> >>>>>>> SPARK-20845 Support specification of column names
in INSERT INTO
>>> >>>>>>> SPARK-24417 Build and Run Spark on JDK11
>>> >>>>>>> SPARK-24724 Discuss necessary info and access in
barrier mode +
>>> Kubernetes
>>> >>>>>>> SPARK-24725 Discuss necessary info and access in
barrier mode +
>>> Mesos
>>> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>> >>>>>>> MesosFineGrainedSchedulerBackend
>>> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for
hadoop-3.2
>>> >>>>>>> SPARK-25186 Stabilize Data Source V2 API
>>> >>>>>>> SPARK-25376 Scenarios we should handle but missed
in 2.4 for
>>> barrier
>>> >>>>>>> execution mode
>>> >>>>>>> SPARK-25390 data source V2 API refactoring
>>> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based
>>> Partition Spec
>>> >>>>>>> SPARK-15691 Refactor and improve Hive support
>>> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>> >>>>>>> SPARK-16217 Support SELECT INTO statement
>>> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not
working
>>> >>>>>>> SPARK-18245 Improving support for bucketed table
>>> >>>>>>> SPARK-19842 Informational Referential Integrity
Constraints
>>> Support in Spark
>>> >>>>>>> SPARK-22231 Support of map, filter, withColumn,
dropColumn in
>>> nested
>>> >>>>>>> list of structures
>>> >>>>>>> SPARK-22632 Fix the behavior of timestamp values
for R's
>>> DataFrame to
>>> >>>>>>> respect session timezone
>>> >>>>>>> SPARK-22386 Data Source V2 improvements
>>> >>>>>>> SPARK-24723 Discuss necessary info and access in
barrier mode +
>>> YARN
>>> >>>>>>>
>>> >>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Name : Jungtaek Lim
>>> >>>> Blog : http://medium.com/@heartsavior
>>> >>>> Twitter : http://twitter.com/heartsavior
>>> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> John Zhuge
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Twitter: https://twitter.com/holdenkarau
>>> >> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message