spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (JIRA)" <>
Subject [jira] [Commented] (SPARK-26859) Reading ORC files with explicit schema can result in wrong data
Date Tue, 12 Feb 2019 21:43:01 GMT


Dongjoon Hyun commented on SPARK-26859:

Thank you for reporting, [~ivan.vergiliev]. I understand this was marked as `Blocker` since
it returns incorrect data, but it would be lower than that since this happens at user-given
schema situation and non-vectorized path. Anyway, I'll review it swiftly.

> Reading ORC files with explicit schema can result in wrong data
> ---------------------------------------------------------------
>                 Key: SPARK-26859
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Ivan Vergiliev
>            Priority: Major
>              Labels: correctness
> There is a bug in the ORC deserialization code that, when triggered, results in completely
wrong data being read. I've marked this as a Blocker as per the docs in
as it's a data correctness issue.
> The bug is triggered when the following set of conditions are all met:
> - the non-vectorized ORC reader is being used;
> - a schema is explicitly specified when reading the ORC file
> - the provided schema has columns not present in the ORC file, and these columns are
in the middle of the schema
> - the ORC file being read contains null values in the columns after the ones added by
the schema.
> When all of these are met:
> - the internal state of the ORC deserializer gets messed up, and, as a result
> - the null values from the ORC file end up being set on wrong columns, not the one they're
in, and
> - the old values from the null columns don't get cleared from the previous record.
> Here's a concrete example. Let's consider the following DataFrame:
> {code:scala}
>         val rdd = sparkContext.parallelize(Seq((1, 2, "abc"), (4, 5, "def"), (8, 9, null)))
>         val df = rdd.toDF("col1", "col2", "col3")
> {code}
> and the following schema:
> {code:scala}
> col1 int, col4 int, col2 int, col3 string
> {code}
> Notice the `col4 int` added in the middle that doesn't exist in the dataframe.
> Saving this dataframe to ORC and then reading it back with the specified schema should
result in reading the same values, with nulls for `col4`. Instead, we get the following back:
> {code:java}
> [1,null,2,abc]
> [4,null,5,def]
> [8,null,null,def]
> {code}
> Notice how the `def` from the second record doesn't get properly cleared and ends up
in the third record as well; also, instead of `col2 = 9` in the last record as expected, we
get the null that should've been in column 3 instead.
> *Impact*
> When this issue is triggered, it results in completely wrong results being read from
the ORC file. The set of conditions under which it gets triggered is somewhat narrow so the
set of affected users is probably limited. There are possibly also people that are affected
but haven't realized it because the conditions are so obscure.
> *Bug details*
> The issue is caused by calling `setNullAt` with a wrong index in `OrcDeserializer.scala:deserialize()`.
I have a fix that I'll send out for review shortly.
> *Workaround*
> This bug is currently only triggered when new columns are added to the middle of the
schema. This means that it can be worked around by only adding new columns at the end.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message