I am doing a Sql query that return a Dataframe. Then I am writing the result of the query using “df.write”, but the result get written in a lot of different small files (~100 of 200 ko). So now I am doing a “.coalesce(2)” before the write.
But the number “2” that I picked is static, is there have a way of dynamically picking the number depending of the file size wanted? (around 256mb would be perfect)
I am running spark 1.6 on CDH using yarn, the files are written in parquet format.