spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
Subject How does extending an existing parquet with columns affect impala/spark performance?
Date Tue, 03 Apr 2018 14:14:24 GMT
This is not strictly a spark question but I'll give it a shot:

have an existing setup of parquet files that are being queried from impala
and from spark.

I intend to add some 30 relatively 'heavy' columns to the parquet. Each
column would store an array of structs. Each struct can have from 5 to 20
fields. An array may have a couple of thousands of structs.

Theoretically, parquet being a columnar storage- extending it with columns
should not affect performance of *existing* queries (since they are not
touching these columns).

   - Is this premise correct?
   - What should I watch out for doing this move?
   - In general, what are the considerations when deciding on the "width"
   (i.e amount of columns) of a parquet file?

Mime
View raw message