spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yann Moisan <yam...@gmail.com>
Subject [Spark SQL] Does Spark group small files
Date Tue, 13 Nov 2018 20:28:20 GMT
Hello,

I'm using Spark 2.3.1.

I have a job that reads 5.000 small parquet files into s3.

When I do a mapPartitions followed by a collect, only *278* tasks are used
(I would have expected 5000). Does Spark group small files ? If yes, what
is the threshold for grouping ? Is it configurable ? Any link to
corresponding source code ?

Rgds,

Yann.

Mime
View raw message