spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomer Benyamini <tomer....@gmail.com>
Subject S3NativeFileSystem inefficient implementation when calling sc.textFile
Date Wed, 26 Nov 2014 08:06:29 GMT
Hello,

I'm building a spark app required to read large amounts of log files from
s3. I'm doing so in the code by constructing the file list, and passing it
to the context as following:

val myRDD = sc.textFile("s3n://mybucket/file1, s3n://mybucket/file2, ... ,
s3n://mybucket/fileN")

When running it locally there are no issues, but when running it on the
yarn-cluster (running spark 1.1.0, hadoop 2.4), I'm seeing an inefficient
linear piece of code running, which could probably be easily parallelized:


[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file1

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file2

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file3

....

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/fileN


I believe there are some difference between my local classpath and the
cluster's classpath - locally I see that
*org.apache.hadoop.fs.s3native.NativeS3FileSystem* is being used, whereas
on the cluster *com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem* is
being used. Any suggestions?


Thanks,

Tomer

Mime
View raw message