spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Filter applied on merged Parquet shemsa with new column fails.
Date Wed, 28 Oct 2015 02:11:29 GMT
When enabling mergedSchema and predicate filter, this fails since Parquet
filters are pushed down regardless of each schema of the splits (or rather
files).

Dominic Ricard reported this issue (
https://issues.apache.org/jira/browse/SPARK-11103)

Even though this would work okay by setting spark.sql.parquet.filterPushdown
to false, the default value of this is true. So this looks an issue.

My questions are,
is this clearly an issue?
and if so, which way would this be handled?


I thought this is an issue and I made three rough patches for this and
tested them and this looks fine though.

The first approach looks simpler and appropriate as I presume from the
previous approaches such as
https://issues.apache.org/jira/browse/SPARK-11153

However, in terms of safety and performances, I also want to ensure which
one would be a proper approach before trying to open a PR.

1. Simply set false to spark.sql.parquet.filterPushdown when using
mergeSchema

2. If spark.sql.parquet.filterPushdown is true, retrieve all the schema of
every part-files (and also merged one) and check if each can accept the
given schema and then, apply the filter only when they all can accept,
which I think it's a bit over-implemented.

3. If spark.sql.parquet.filterPushdown is true, retrieve all the schema of
every part-files (and also merged one) and apply the filter to each split
(rather file) that can accept the filter which (I think it's hacky) ends up
different configurations for each task in a job.

Mime
View raw message