spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: S3NativeFileSystem inefficient implementation when calling sc.textFile
Date Wed, 26 Nov 2014 17:23:09 GMT
Spark has a known problem where it will do a pass of metadata on a large
number of small files serially, in order to find the partition information
prior to starting the job. This will probably not be repaired by switching
the FS impl.

However, you can change the FS being used like so (prior to the first
usage):
sc.hadoopConfiguration.set("fs.s3n.impl",
"org.apache.hadoop.fs.s3native.NativeS3FileSystem")

On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini <tomer.ben@gmail.com>
wrote:

> Thanks Lalit; Setting the access + secret keys in the configuration works
> even when calling sc.textFile. Is there a way to select which hadoop s3
> native filesystem implementation would be used at runtime using the hadoop
> configuration?
>
> Thanks,
> Tomer
>
> On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <lalit@sigmoidanalytics.com>
> wrote:
>
>>
>> you can try creating hadoop Configuration and set s3 configuration i.e.
>> access keys etc.
>> Now, for reading files from s3 use newAPIHadoopFile and pass the config
>> object here along with key, value classes.
>>
>>
>>
>>
>>
>> -----
>> Lalit Yadav
>> lalit@sigmoidanalytics.com
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Mime
View raw message