spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [VOTE] Release Spark 3.1.1 (RC1)
Date Sun, 07 Feb 2021 02:09:17 GMT
Looks like we resolved all standing issues known so far. I will start
another RC next Monday PST.

2021년 2월 4일 (목) 오전 12:03, Kent Yao <yaooqinn@qq.com>님이 작성:

> Sending https://github.com/apache/spark/pull/31460
>
> Based my research so far, when there is there is an existing
> *io.file.buffer.size* in hive-site.xml, the hadoopConf finallly get reset
> by that.
> In many real-world cases, when interacting with hive catalog through
> Spark SQL, users may just share thehive-site.xm for their hive jobs and
> make a copy to SPARK_HOM/conf w/o modification. In Spark, when we
> generate Hadoop configurations, we will use*spark.buffer.size(65536)* to
> reseti*o.file.buffer.size(4096)*. But when we load the hive-site.xml, we
> may ignore this behavior and reset *io.file.buffer.size* again according
> to hive-site.xml.
>
> The PR fixes:
> 1. The configuration priority for setting Hadoop and Hive config here is
> not right, while literally, the order should be *spark > spark.hive >
> spark.hadoop > hive > hadoop*
> 2. This breaks *spark.buffer.size* congfig's behavior for tuning the IO
> performance w/ HDFS if there is an existing io.file.buffer.size in
> hive-site.xml
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark <http://spark.apache.org/>.*
> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark <http://spark.apache.org/>.*
> *spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A
> library that brings excellent and useful functions from various modern
> database management systems to Apache Spark <http://spark.apache.org/>.*
>
>
>
> On 02/3/2021 15:36,Maxim Gekk<maxim.gekk@databricks.com>
> <maxim.gekk@databricks.com> wrote:
>
> Hi All,
>
> > Also I am investigating a performance regression in some TPC-DS queries
> (q88 for instance) that is caused by a recent commit in 3.1 ...
>
> I have found that the perf regression is caused by the Hadoop config:
> io.file.buffer.size = 4096
> Before the commit
> https://github.com/apache/spark/commit/278f6f45f46ccafc7a31007d51ab9cb720c9cb14,
> we had:
> io.file.buffer.size = 65536
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Wed, Feb 3, 2021 at 2:37 AM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>
>> Yeah, agree. I changed. Thanks for the heads up. Tom.
>>
>> 2021년 2월 3일 (수) 오전 8:31, Tom Graves <tgraves_cs@yahoo.com>님이
작성:
>>
>>> ok thanks for the update. That is marked as an improvement, if its a
>>> blocker can we mark it as such and describe why.  I searched jiras and
>>> didn't see any critical or blockers open.
>>>
>>> Tom
>>> On Tuesday, February 2, 2021, 05:12:24 PM CST, Hyukjin Kwon <
>>> gurwls223@gmail.com> wrote:
>>>
>>>
>>> There is one here: https://github.com/apache/spark/pull/31440. There
>>> look several issues being identified (to confirm that this is an issue in
>>> OSS too), and fixed in parallel.
>>> There are a bit of unexpected delays here as several issues more were
>>> found. I will try to file and share relevant JIRAs as soon as I can confirm.
>>>
>>> 2021년 2월 3일 (수) 오전 2:36, Tom Graves <tgraves_cs@yahoo.com>님이
작성:
>>>
>>> Just curious if we have an update on next rc? is there a jira for the
>>> tpcds issue?
>>>
>>> Thanks,
>>> Tom
>>>
>>> On Wednesday, January 27, 2021, 05:46:27 PM CST, Hyukjin Kwon <
>>> gurwls223@gmail.com> wrote:
>>>
>>>
>>> Just to share the current status, most of the known issues were
>>> resolved. Let me know if there are some more.
>>> One thing left is a performance regression in TPCDS being investigated.
>>> Once this is identified (and fixed if it should be), I will cut another RC
>>> right away.
>>> I roughly expect to cut another RC next Monday.
>>>
>>> Thanks guys.
>>>
>>> 2021년 1월 27일 (수) 오전 5:26, Terry Kim <yuminkim@gmail.com>님이
작성:
>>>
>>> Hi,
>>>
>>> Please check if the following regression should be included:
>>> https://github.com/apache/spark/pull/31352
>>>
>>> Thanks,
>>> Terry
>>>
>>> On Tue, Jan 26, 2021 at 7:54 AM Holden Karau <holden@pigscanfly.ca>
>>> wrote:
>>>
>>> If were ok waiting for it, I’d like to get
>>> https://github.com/apache/spark/pull/31298 in as well (it’s not a
>>> regression but it is a bug fix).
>>>
>>> On Tue, Jan 26, 2021 at 6:38 AM Hyukjin Kwon <gurwls223@gmail.com>
>>> wrote:
>>>
>>> It looks like a cool one but it's a pretty big one and affects the plans
>>> considerably ... maybe it's best to avoid adding it into 3.1.1 in
>>> particular during the RC period if this isn't a clear regression that
>>> affects many users.
>>>
>>> 2021년 1월 26일 (화) 오후 11:23, Peter Toth <peter.toth@gmail.com>님이
작성:
>>>
>>> Hey,
>>>
>>> Sorry for chiming in a bit late, but I would like to suggest my PR (
>>> https://github.com/apache/spark/pull/28885) for review and inclusion
>>> into 3.1.1.
>>>
>>> Currently, invalid reuse reference nodes appear in many queries, causing
>>> performance issues and incorrect explain plans. Now that
>>> https://github.com/apache/spark/pull/31243 got merged these invalid
>>> references can be easily found in many of our golden files on master:
>>> https://github.com/apache/spark/pull/28885#issuecomment-767530441.
>>> But the issue isn't master (3.2) specific, actually it has been there
>>> since 3.0 when Dynamic Partition Pruning was added.
>>> So it is not a regression from 3.0 to 3.1.1, but in some cases (like
>>> TPCDS q23b) it is causing performance regression from 2.4 to 3.x.
>>>
>>> Thanks,
>>> Peter
>>>
>>> On Tue, Jan 26, 2021 at 6:30 AM Hyukjin Kwon <gurwls223@gmail.com>
>>> wrote:
>>>
>>> Guys, I plan to make an RC as soon as we have no visible issues. I have
>>> merged a few correctness issues. There look:
>>> - https://github.com/apache/spark/pull/31319 waiting for a review (I
>>> will do it too soon).
>>> - https://github.com/apache/spark/pull/31336
>>> - I know Max's investigating the perf regression one which hopefully
>>> will be fixed soon.
>>>
>>> Are there any more blockers or correctness issues? Please ping me or say
>>> it out here.
>>> I would like to avoid making an RC when there are clearly some issues to
>>> be fixed.
>>> If you're investigating something suspicious, that's fine too. It's
>>> better to make sure we're safe instead of rushing an RC without finishing
>>> the investigation.
>>>
>>> Thanks all.
>>>
>>>
>>> 2021년 1월 22일 (금) 오후 6:19, Hyukjin Kwon <gurwls223@gmail.com>님이
작성:
>>>
>>> Sure, thanks guys. I'll start another RC after the fixes. Looks like
>>> we're almost there.
>>>
>>> On Fri, 22 Jan 2021, 17:47 Wenchen Fan, <cloud0fan@gmail.com> wrote:
>>>
>>> BTW, there is a correctness bug being fixed at
>>> https://github.com/apache/spark/pull/30788 . It's not a regression, but
>>> the fix is very simple and it would be better to start the next RC after
>>> merging that fix.
>>>
>>> On Fri, Jan 22, 2021 at 3:54 PM Maxim Gekk <maxim.gekk@databricks.com>
>>> wrote:
>>>
>>> Also I am investigating a performance regression in some TPC-DS queries
>>> (q88 for instance) that is caused by a recent commit in 3.1, highly likely
>>> in the period from 19th November, 2020 to 18th December, 2020.
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.
>>>
>>>
>>> On Fri, Jan 22, 2021 at 10:45 AM Wenchen Fan <cloud0fan@gmail.com>
>>> wrote:
>>>
>>> -1 as I just found a regression in 3.1. A self-join query works well in
>>> 3.0 but fails in 3.1. It's being fixed at
>>> https://github.com/apache/spark/pull/31287
>>>
>>> On Fri, Jan 22, 2021 at 4:34 AM Tom Graves <tgraves_cs@yahoo.com.invalid>
>>> wrote:
>>>
>>> +1
>>>
>>> built from tarball, verified sha and regular CI and tests all pass.
>>>
>>> Tom
>>>
>>> On Monday, January 18, 2021, 06:06:42 AM CST, Hyukjin Kwon <
>>> gurwls223@gmail.com> wrote:
>>>
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.1.1.
>>>
>>> The vote is open until January 22nd 4PM PST and passes if a majority +1
>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.1.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.1.1-rc1 (commit
>>> 53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d):
>>> https://github.com/apache/spark/tree/v3.1.1-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1364
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/
>>>
>>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>>> https://s.apache.org/41kf2
>>>
>>> This release is using the release script of the tag v3.1.1-rc1.
>>>
>>> FAQ
>>>
>>> ===================
>>> What happened to 3.1.0?
>>> ===================
>>>
>>> There was a technical issue during Apache Spark 3.1.0 preparation, and
>>> it was discussed and decided to skip 3.1.0.
>>> Please see
>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html
>>> for more details.
>>>
>>> =========================
>>> How can I help test this release?
>>> =========================
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC via "pip install
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/pyspark-3.1.1.tar.gz
>>> "
>>> and see if anything important breaks.
>>> In the Java/Scala, you can add the staging repository to your projects
>>> resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===========================================
>>> What should happen to JIRA tickets still targeting 3.1.1?
>>> ===========================================
>>>
>>> The current list of open tickets targeted at 3.1.1 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.1.1
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==================
>>> But my bug isn't fixed?
>>> ==================
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>

Mime
View raw message