spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng Lian (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken
Date Thu, 01 Dec 2016 18:42:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712707#comment-15712707
] 

Cheng Lian commented on SPARK-17213:
------------------------------------

Agree that we should disable string and binary filter push down for now until PARQUET-686
gets fixed.

We turned off Parquet filter pushdown for string and binary columns in 1.6 due to PARQUET-251
(see SPARK-11153). In Spark 2.1, we upgraded to Parquet 1.8.1 to get PARQUET-251 fixed, then
this issue pops up due to PARQUET-686. I think this also affects Spark 1.5.1 and prior versions.

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -----------------------------------------------------
>
>                 Key: SPARK-17213
>                 URL: https://issues.apache.org/jira/browse/SPARK-17213
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>            Reporter: Andrew Duffy
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, which compare
bytes as unsigned integers. Currently however Parquet does not respect this ordering. This
is currently in the process of being fixed in Parquet, JIRA and PR link below, but currently
all filters are broken over strings, with there actually being a correctness issue for {{>}}
and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
>     > Seq("a", "é").toDF("name").where("name > 'a'").count
>     1
> {code}
> Querying from a parquet dataset:
> {code}
>     > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
>     > spark.read.parquet("/tmp/bad").where("name > 'a'").count
>     0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's implementation
of comparison of strings is based on signed byte array comparison, so it will actually create
1 row group with statistics {{min=é,max=a}}, and so the row group will be dropped by the
query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness but it
will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message