spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewan Leith <ewan.le...@realitymine.com>
Subject Slow file listing when loading records from in S3 without filename or wildcard
Date Fri, 05 Jun 2015 13:17:40 GMT
Hi all,

I'm not sure if this is a Spark issue, or an AWS/Hadoop/S3 driver issue, but I've noticed
that I get a very slow response when I run:

val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/").count()

(which will count all the files in the directory)

But an almost immediate response if I run this command with a wildcard added to the end:

val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/*").count()

The time difference is in the order of 1 minute extra per 1000 files being listed from S3.
The count returns the same value for each query.

This is on 1000s of files, with no sub-directories to confuse things. Has anyone seen anything
similar?

Thanks,
Ewan

Mime
View raw message