spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arwin Tio <arwin....@hotmail.com>
Subject DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files
Date Fri, 06 Sep 2019 12:09:20 GMT
Hello,

On Spark 2.4.4, I am using DataFrameReader#csv to read about 300000 files on S3, and I've
noticed that it takes about an hour for it to load the data on the Driver. You can see the
timestamp difference when the log from InMemoryFileIndex occurs from 7:45 to 8:54:
19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
19/09/06 07:44:42 INFO SparkContext: Submitted application: LoglineParquetGenerator
...
19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under:
[300K files...]

I believe that the issue comes from DataSource#checkAndGlobPathIfNecessary [0], specifically
from when it is calling FileSystem#exists. Unlike bulkListLeafFiles, the existence check here
happens in a single-threaded flatMap, which is a blocking network call if your files are stored
on S3.

I believe that there is a fairly straightforward opportunity for improvement here, which is
to parallelize the existence check perhaps with a configurable number of threads. If that
seems reasonable, I would like to create a JIRA ticket and submit a patch. Please let me know!

Cheers,

Arwin

[0] https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L557

Mime
View raw message