Seems like this would make sense... we usually make maintenance releases for bug fixes after a month anyway.

On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <> wrote:

On 11 April 2018 at 12:47, Ryan Blue <> wrote:
I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of Spark.

To be clear though, this only affects Spark when reading data written by Impala, right? Or does Parquet CPP also produce data like this?

I don't know about parquet-cpp, but yeah, the only implementation I've seen writing the half-completed stats is Impala. (as you know, that's compliant with the spec, just an unusual choice). 

On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <> wrote:
Hi all - 

SPARK-23852 (where a query can silently give wrong results thanks to a predicate pushdown bug in Parquet) is a fairly bad bug. In other projects I've been involved with, we've released maintenance releases for bugs of this severity.

Since Spark 2.4.0 is probably a while away, I wanted to see if there was any consensus over whether we should consider (at least) a 2.3.1.

The reason this particular issue is a bit tricky is that the Parquet community haven't yet produced a maintenance release that fixes the underlying bug, but they are in the process of releasing a new minor version, 1.10, which includes a fix. Having spoken to a couple of Parquet developers, they'd be willing to consider a maintenance release, but would probably only bother if we (or another affected project) asked them to. 

My guess is that we wouldn't want to upgrade to a new minor version of Parquet for a Spark maintenance release, so asking for a Parquet maintenance release makes sense. 

What does everyone think?


Ryan Blue
Software Engineer