spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Han JU <>
Subject Re: How to read a multipart s3 file?
Date Wed, 07 May 2014 08:00:56 GMT
Just some complements to other answers:

If you output to, say, `s3://bucket/myfile`, then you can use this bucket
as the input of other jobs (sc.textFile('s3://bucket/myfile')). By default
all `part-xxx` files will be used. There's also `sc.wholeTextFiles` that
you can play with.

If you file is small and need to be interoperable by other tools/langs, s3n
may be a better choice. But in my experience, when reading directly from
s3n, spark create only 1 input partition per file, regardless of the file
size. This may lead to some performance problem if you have big files.

2014-05-07 2:39 GMT+02:00 Andre Kuhnen <>:

> Try using s3n instead of s3
> Em 06/05/2014 21:19, "kamatsuoka" <> escreveu:
> I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.
>> Behind the scenes, the S3 driver creates a bunch of files like
>> s3://mybucket//mydir/myfile.txt/part-0000, as well as the block files like
>> s3://mybucket/block_3574186879395643429.
>> How do I construct an url to use this file as input to another Spark app?
>>  I
>> tried all the variations of s3://mybucket/mydir/myfile.txt, but none of
>> them
>> work.
>> --
>> View this message in context:
>> Sent from the Apache Spark User List mailing list archive at

*JU Han*

Data Engineer @

+33 0619608888

View raw message