From my understanding, when reading small files Spark will group them and load the content of each batch into the same partition so you won’t end up with 1 partition per file resulting in a huge number of very small partitions. This behavior is controlled by the spark.files.maxPartitionBytes parameter set to 128 MiB by default. For example if you have only 8 MiB files on your file system, you will end up with partitions holding the content of 16 files. If your files are heavily compressed, it can result in pretty fat partitions of size spark.files.maxPartitionBytes/compression ratio.
I can’t give you a link to a specific source code snippet but this is my experience from working with a lot of small parquet files.
De : Yann Moisan [mailto:email@example.com]
Envoyé : mardi 13 novembre 2018 21:28
À : firstname.lastname@example.org
Objet : [Spark SQL] Does Spark group small files
I'm using Spark 2.3.1.
I have a job that reads 5.000 small parquet files into s3.
When I do a
mapPartitions followed by a
collect, only 278 tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping
? Is it configurable ? Any link to corresponding source code ?
Accédez aux meilleurs tarifs Air France, gérez vos réservations et enregistrez-vous en ligne sur http://www.airfrance.com
Find best Air France fares, manage your reservations and check in online at http://www.airfrance.com