spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rishi Shah <rishishah.s...@gmail.com>
Subject [Pyspark 2.4] Large number of row groups in parquet files created using spark
Date Thu, 25 Jul 2019 01:29:10 GMT
Hi All,

I have the following code which produces 1 600MB parquet file as expected,
however within this parquet file there are 42 row groups! I would expect it
to crate max 6 row groups, could someone please shed some light on this? Is
there any config setting which I can enable while submitting application
using spark-submit?

df = spark.read.parquet(INPUT_PATH)
df.coalesce(1).write.parquet(OUT_PATH)

I did try --conf spark.parquet.block.size & spark.dfs.blocksize, but that
makes no difference.

-- 
Regards,

Rishi Shah

Mime
View raw message