spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From naresh Goud <nareshgoud.du...@gmail.com>
Subject Re: How does extending an existing parquet with columns affect impala/spark performance?
Date Tue, 03 Apr 2018 14:40:39 GMT
>From spark point of view it shouldn’t effect. it’s possible to extend
columns of new parquet files and it won’t affect Performance and not
required to change spark application code.



On Tue, Apr 3, 2018 at 9:14 AM Vitaliy Pisarev <vitaliy.pisarev@biocatch.com>
wrote:

> This is not strictly a spark question but I'll give it a shot:
>
> have an existing setup of parquet files that are being queried from impala
> and from spark.
>
> I intend to add some 30 relatively 'heavy' columns to the parquet. Each
> column would store an array of structs. Each struct can have from 5 to 20
> fields. An array may have a couple of thousands of structs.
>
> Theoretically, parquet being a columnar storage- extending it with columns
> should not affect performance of *existing* queries (since they are not
> touching these columns).
>
>    - Is this premise correct?
>    - What should I watch out for doing this move?
>    - In general, what are the considerations when deciding on the "width"
>    (i.e amount of columns) of a parquet file?
>
>
> --
Thanks,
Naresh
www.linkedin.com/in/naresh-dulam
http://hadoopandspark.blogspot.com/

Mime
View raw message