I submitted a bug report on this issue earlier (SPARK-25145), but it was closed as it could not be reproduced. Furthermore, it was stated that this error is usually caused my a misconfiguration.

Well, we're still having this issue and would very much like to solve our misconfiguration. The gist of it is that:


import pandas as pd

values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
df = pd.DataFrame({
    'a': values,
    'b': [v / 2.0 for v in values]})

sdf = spark.createDataFrame(df)

sdf.write.saveAsTable(format='orc', mode='overwrite', name='spark_buffer_size_too_small_on_filter_pushdown', compression='zlib')

spark.conf.set('spark.sql.orc.filterPushdown', True)

spark.sql('SELECT * FROM spark_buffer_size_too_small_on_filter_pushdown WHERE
a > 5').show()


Results in a crash:

"java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 220599 at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212)"

which is quite annoying.

I short, I need help. But I don't even know which configuration is wrong. Which configs are relevant here so solve this?


Hive 1.2.1000.
Spark 2.3.3-SNAPSHOT                     
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_192

Could someone please help me out?

Best regards,
Bjørnar Jensen
Bjørnar Jensen
PhD/Forsker II/Senior researcher
+47 555 84 091
Nygårdsgaten 112, 5008 Bergen, Norway

NORCE Norwegian Research Centre AS