spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomer Benyamini <tomer....@gmail.com>
Subject Re: S3NativeFileSystem inefficient implementation when calling sc.textFile
Date Sat, 29 Nov 2014 21:16:38 GMT
Thanks - this is very helpful!

On Thu, Nov 27, 2014 at 5:20 AM, Michael Armbrust <michael@databricks.com>
wrote:

> In the past I have worked around this problem by avoiding sc.textFile().
> Instead I read the data directly inside of a Spark job.  Basically, you
> start with an RDD where each entry is a file in S3 and then flatMap that
> with something that reads the files and returns the lines.
>
> Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe
>
> Using this class you can do something like:
>
> sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... ::
> Nil).flatMap(new ReadLinesSafe(_))
>
> You can also build up the list of files by running a Spark job:
> https://gist.github.com/marmbrus/15e72f7bc22337cf6653
>
> Michael
>
> On Wed, Nov 26, 2014 at 9:23 AM, Aaron Davidson <ilikerps@gmail.com>
> wrote:
>
>> Spark has a known problem where it will do a pass of metadata on a large
>> number of small files serially, in order to find the partition information
>> prior to starting the job. This will probably not be repaired by switching
>> the FS impl.
>>
>> However, you can change the FS being used like so (prior to the first
>> usage):
>> sc.hadoopConfiguration.set("fs.s3n.impl",
>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>
>> On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini <tomer.ben@gmail.com>
>> wrote:
>>
>>> Thanks Lalit; Setting the access + secret keys in the configuration
>>> works even when calling sc.textFile. Is there a way to select which hadoop
>>> s3 native filesystem implementation would be used at runtime using the
>>> hadoop configuration?
>>>
>>> Thanks,
>>> Tomer
>>>
>>> On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <lalit@sigmoidanalytics.com>
>>> wrote:
>>>
>>>>
>>>> you can try creating hadoop Configuration and set s3 configuration i.e.
>>>> access keys etc.
>>>> Now, for reading files from s3 use newAPIHadoopFile and pass the config
>>>> object here along with key, value classes.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----
>>>> Lalit Yadav
>>>> lalit@sigmoidanalytics.com
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Mime
View raw message