spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yijie Shen <henry.yijies...@gmail.com>
Subject Parquet File Binary column statistics error when reuse byte[] among rows
Date Sun, 12 Apr 2015 05:50:21 GMT
Hi,

Suppose I create a dataRDD which extends RDD[Row], and each row is
GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is
reused among rows but has different content each time. When I convert it to
a dataFrame and save it as Parquet File, the file's row group statistic(max
& min) of Binary column would be wrong.



Here is the reason: In Parquet, BinaryStatistic just keep max & min as
parquet.io.api.Binary references, Spark sql would generate a new Binary
backed by the same Array[Byte] passed from row.
 reference backed max: Binary---------->ByteArrayBackedBinary---------->
Array[Byte]

Therefore, each time parquet updating row group's statistic, max & min
would always refer to the same Array[Byte], which has new content each
time. When parquet decides to save it into file, the last row's content
would be saved as both max & min.



It seems it is a parquet bug because it's parquet's responsibility to
update statistics correctly.
But not quite sure. Should I report it as a bug in parquet JIRA?


The spark JIRA is https://issues.apache.org/jira/browse/SPARK-6859

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message