spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Han JU <ju.han.fe...@gmail.com>
Subject Re: How to read a multipart s3 file?
Date Wed, 07 May 2014 08:00:56 GMT
Just some complements to other answers:

If you output to, say, `s3://bucket/myfile`, then you can use this bucket
as the input of other jobs (sc.textFile('s3://bucket/myfile')). By default
all `part-xxx` files will be used. There's also `sc.wholeTextFiles` that
you can play with.

If you file is small and need to be interoperable by other tools/langs, s3n
may be a better choice. But in my experience, when reading directly from
s3n, spark create only 1 input partition per file, regardless of the file
size. This may lead to some performance problem if you have big files.


2014-05-07 2:39 GMT+02:00 Andre Kuhnen <andrekuhnen@gmail.com>:

> Try using s3n instead of s3
> Em 06/05/2014 21:19, "kamatsuoka" <kenjim@gmail.com> escreveu:
>
> I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.
>>
>> Behind the scenes, the S3 driver creates a bunch of files like
>> s3://mybucket//mydir/myfile.txt/part-0000, as well as the block files like
>> s3://mybucket/block_3574186879395643429.
>>
>> How do I construct an url to use this file as input to another Spark app?
>>  I
>> tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
>> them
>> work.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>


-- 
*JU Han*

Data Engineer @ Botify.com

+33 0619608888

Mime
View raw message