spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Demoor <>
Subject Re: SaveAsTextFile to S3 bucket
Date Tue, 27 Jan 2015 10:15:19 GMT
S3 does not have the concept "directory". An S3 bucket only holds files
(objects). The hadoop filesystem is mapped onto a bucket and use
Hadoop-specific (or rather "s3tool"-specific: s3n uses the jets3t tool)
conventions(hacks) to fake directories such as a ending with a slash
("filename/") and with s3n by "filename_$folder$" (these are leaky
abstractions, google that if you ever have some spare time :p). S3 simply
doesn't (and shouldn't) know about these conventions. Again, a bucket just
holds a shitload of files. This might seem inconvenient but directories are
really bad idea for scalable storage. However, setting "folder-like"
permissions can be done through IAM:

Summarizing: by setting permissions on /dev you set permissions on that
object. It has no effect on the file "/dev/output" which is, as far as S3
cares, another object that happens to share part of the objectname with

Thomas Demoor
skype: demoor.thomas
mobile: +32 497883833

On Tue, Jan 27, 2015 at 6:33 AM, Chen, Kevin <> wrote:

>  When spark saves rdd to a text file, the directory must not exist
> upfront. It will create a directory and write the data to part-0000 under
> that directory. In my use case, I create a directory dev in the bucket ://
> nexgen-software/dev . I expect it creates output direct under dev and a
> part-0000 under output. But it gave me exception as I only give write
> permission to dev not the bucket. If I open up write permission to bucket,
> it worked. But it did not create output directory under dev, it rather
> creates another dev/output directory under bucket. I just want to know if
> it is possible to have output directory created under dev directory I
> created upfront.
>   From: Nick Pentreath <>
> Date: Monday, January 26, 2015 9:15 PM
> To: "" <>
> Subject: Re: SaveAsTextFile to S3 bucket
>   Your output folder specifies
>  rdd.saveAsTextFile("s3n://nexgen-software/dev/output");
>  So it will try to write to /dev/output which is as expected. If you
> create the directory /dev/output upfront in your bucket, and try to save it
> to that (empty) directory, what is the behaviour?
> On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin <>
> wrote:
>>  Does anyone know if I can save a RDD as a text file to a pre-created
>> directory in S3 bucket?
>>  I have a directory created in S3 bucket: //nexgen-software/dev
>>  When I tried to save a RDD as text file in this directory:
>> rdd.saveAsTextFile("s3n://nexgen-software/dev/output");
>>  I got following exception at runtime:
>> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
>> org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' -
>> ResponseCode=403, ResponseMessage=Forbidden
>>  I have verified /dev has write permission. However, if I grant the
>> bucket //nexgen-software write permission, I don't get exception. But the
>> output is not created under dev. Rather, a different /dev/output directory
>> is created directory in the bucket (//nexgen-software). Is this how
>> saveAsTextFile behalves in S3? Is there anyway I can have output created
>> under a pre-defied directory.
>>  Thanks in advance.

View raw message