spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomas Bartalos <tomas.barta...@gmail.com>
Subject Parquet read performance for different schemas
Date Thu, 19 Sep 2019 16:10:06 GMT
Hello,

I have 2 parquets (each containing 1 file):

   - parquet-wide - schema has 25 top level cols + 1 array
   - parquet-narrow - schema has 3 top level cols

Both files have same data for given columns.
When I read from parquet-wide spark reports* read 52.6 KB*, from
parquet-narrow *only 2.6 KB*.
For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say
reading narrow parquet is much faster.

Since schema pruning is applied I *expected to get similar results* for
both scenarios (timing and amount of data read).
What do you think is the reason for such a big difference, is there any
tuning I can do ?

Thank you,
Tomas

Mime
View raw message