spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: Maintenance releases for SPARK-23852?
Date Wed, 18 Apr 2018 05:42:05 GMT
Since it's a backport from master to branch-2.3 for ORC 1.4.3, I made a
backport PR.

https://github.com/apache/spark/pull/21093

Thank you for raising this issues and confirming, Henry and Xiao. :)

Bests,
Dongjoon.


On Tue, Apr 17, 2018 at 12:01 AM, Xiao Li <gatorsmile@gmail.com> wrote:

> Yes, it sounds good to me. We can upgrade both Parquet 1.8.2 to 1.8.3 and
> ORC 1.4.1 to 1.4.3 in our upcoming Spark 2.3.1 release.
>
> Thanks for your efforts! @Henry and @Dongjoon
>
> Xiao
>
> 2018-04-16 14:41 GMT-07:00 Henry Robinson <henry@apache.org>:
>
>> Seems like there aren't any objections. I'll pick this thread back up
>> when a Parquet maintenance release has happened.
>>
>> Henry
>>
>> On 11 April 2018 at 14:00, Dongjoon Hyun <dongjoon.hyun@gmail.com> wrote:
>>
>>> Great.
>>>
>>> If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache
>>> Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together.
>>>
>>> Currently, the patch is only merged into master branch now. 1.4.1 has
>>> the following issue.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-23340
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin <rxin@databricks.com>
>>> wrote:
>>>
>>>> Seems like this would make sense... we usually make maintenance
>>>> releases for bug fixes after a month anyway.
>>>>
>>>>
>>>> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <henry@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 11 April 2018 at 12:47, Ryan Blue <rblue@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases
of
>>>>>> Spark.
>>>>>>
>>>>>> To be clear though, this only affects Spark when reading data written
>>>>>> by Impala, right? Or does Parquet CPP also produce data like this?
>>>>>>
>>>>>
>>>>> I don't know about parquet-cpp, but yeah, the only implementation I've
>>>>> seen writing the half-completed stats is Impala. (as you know, that's
>>>>> compliant with the spec, just an unusual choice).
>>>>>
>>>>>
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <henry@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all -
>>>>>>>
>>>>>>> SPARK-23852 (where a query can silently give wrong results thanks
to
>>>>>>> a predicate pushdown bug in Parquet) is a fairly bad bug. In
other projects
>>>>>>> I've been involved with, we've released maintenance releases
for bugs of
>>>>>>> this severity.
>>>>>>>
>>>>>>> Since Spark 2.4.0 is probably a while away, I wanted to see if
there
>>>>>>> was any consensus over whether we should consider (at least)
a 2.3.1.
>>>>>>>
>>>>>>> The reason this particular issue is a bit tricky is that the
Parquet
>>>>>>> community haven't yet produced a maintenance release that fixes
the
>>>>>>> underlying bug, but they are in the process of releasing a new
minor
>>>>>>> version, 1.10, which includes a fix. Having spoken to a couple
of Parquet
>>>>>>> developers, they'd be willing to consider a maintenance release,
but would
>>>>>>> probably only bother if we (or another affected project) asked
them to.
>>>>>>>
>>>>>>> My guess is that we wouldn't want to upgrade to a new minor version
>>>>>>> of Parquet for a Spark maintenance release, so asking for a Parquet
>>>>>>> maintenance release makes sense.
>>>>>>>
>>>>>>> What does everyone think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Henry
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message