spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Christle <dchris...@linkedin.com.INVALID>
Subject Spark 3.0 and ORC 1.6
Date Tue, 28 Jan 2020 20:40:57 GMT
Hi all,

I am a heavy user of Spark at LinkedIn, and am excited about the ZStandard compression option
recently incorporated into ORC 1.6. I would love to explore using it for storing/querying
of large (>10 TB) tables for my own disk I/O intensive workloads, and other users &
companies may be interested in adopting ZStandard more broadly, since it seems to offer faster
compression speeds at higher compression ratios with better multi-threaded support than zlib/Snappy.
At scale, improvements of even ~10% on disk and/or compute, hopefully just from setting the
“orc.compress” flag to a different value, could translate into palpable gains in capacity/cost
cluster wide without requiring broad engineering migrations. See a somewhat recent FB Engineering
blog post on the topic for their reported experiences: https://engineering.fb.com/core-data/zstandard/

Do we know if ORC 1.6.x will make the cut for Spark 3.0?

A recent PR (https://github.com/apache/spark/pull/26669) updated ORC to 1.5.8, but I don’t
have a good understanding of how difficult incorporating ORC 1.6.x into Spark will be. For
instance, in the PRs for enabling Java Zstd in ORC (https://github.com/apache/orc/pull/306
& https://github.com/apache/orc/pull/412), some additional work/discussion around Hadoop
shims occurred to maintain compatibility across different versions of Hadoop (e.g. 2.7) and
aircompressor (a library containing Java implementations of various compression codecs, so
that dependence on Hadoop 2.9 is not required). Again, these may be non-issues, but I wanted
to kindle discussion around whether this can make the cut for 3.0, since I imagine it’s
a major upgrade many users will focus on migrating to once released.

Kind regards,
David Christle
Mime
View raw message