spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomas Bartalos <tomas.barta...@gmail.com>
Subject Re: Parquet read performance for different schemas
Date Fri, 20 Sep 2019 09:36:48 GMT
I forgot to mention important part that I'm issuing same query to both
parquets - selecting only one column:

df.select(sum('amount))

BR,
Tomas

št 19. 9. 2019 o 18:10 Tomas Bartalos <tomas.bartalos@gmail.com> napísal(a):

> Hello,
>
> I have 2 parquets (each containing 1 file):
>
>    - parquet-wide - schema has 25 top level cols + 1 array
>    - parquet-narrow - schema has 3 top level cols
>
> Both files have same data for given columns.
> When I read from parquet-wide spark reports* read 52.6 KB*, from
> parquet-narrow *only 2.6 KB*.
> For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say
> reading narrow parquet is much faster.
>
> Since schema pruning is applied I *expected to get similar results* for
> both scenarios (timing and amount of data read).
> What do you think is the reason for such a big difference, is there any
> tuning I can do ?
>
> Thank you,
> Tomas
>

Mime
View raw message