spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: AVRO vs Parquet
Date Thu, 10 Mar 2016 18:38:08 GMT
A few clarifications:


> 1) High memory and cpu usage. This is because Parquet files can't be
> streamed into as records arrive. I have seen a lot of OOMs in reasonably
> sized MR/Spark containers that write out Parquet. When doing dynamic
> partitioning, where many writers are open at once, we’ve seen customers
> having trouble to make it work. This has made for some very confused ETL
> developers.
>

In Spark 1.6.1 we avoid having more than 2 files open per task, so this
should be less of a problem even for dynamic partitioning.


> 2) Parquet lags well behind Avro in schema evolution semantics. Can only
> add columns at the end? Deleting columns at the end is not recommended if
> you plan to add any columns in the future. Reordering is not supported in
> current release.
>

This may be true for Impala, but Spark SQL does schema merging by name so
you can add / reorder columns with the constraint that you cannot reuse a
name with an incompatible type.

Mime
View raw message