spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bjørnar Jensen <b...@norceresearch.no>
Subject [Spark ORC | SQL | Hive] Buffer size too small when using filterPushdown predicate=True (ref.: SPARK-25145)
Date Fri, 23 Nov 2018 10:36:38 GMT
Hi,

I submitted a bug report on this issue earlier (SPARK-25145), but it was closed as it could
not be reproduced. Furthermore, it was stated that this error is usually caused my a misconfiguration.

Well, we're still having this issue and would very much like to solve our misconfiguration.
The gist of it is that:

---

import pandas as pd

values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
df = pd.DataFrame({
    'a': values,
    'b': [v / 2.0 for v in values]})

sdf = spark.createDataFrame(df)

sdf.write.saveAsTable(format='orc', mode='overwrite', name='spark_buffer_size_too_small_on_filter_pushdown',
compression='zlib')

spark.conf.set('spark.sql.orc.filterPushdown', True)

spark.sql('SELECT * FROM spark_buffer_size_too_small_on_filter_pushdown WHERE
a > 5').show()

---

Results in a crash:

"java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 220599
at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212)"

which is quite annoying.

I short, I need help. But I don't even know which configuration is wrong. Which configs are
relevant here so solve this?

Versions:

Hive 1.2.1000.2.6.5.0-292
Spark 2.3.3-SNAPSHOT
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_192


Could someone please help me out?

Best regards,
Bjørnar Jensen
--
Bjørnar Jensen
PhD/Forsker II/Senior researcher
+47 555 84 091
Nygårdsgaten 112, 5008 Bergen, Norway


[cid:norce_426cc920-64ab-4893-95d0-c29d5dab670c.png]

NORCE Norwegian Research Centre AS
norceresearch.no<https://www.norceresearch.no/>
Mime
View raw message