I have a job that reads 5.000 small parquet files into s3.
When I do a
mapPartitions followed by a
collect, only 278 tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping ? Is it configurable ? Any link to corresponding source code ?