spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Demoor <thomas.dem...@amplidata.com>
Subject Re: SaveAsTextFile to S3 bucket
Date Tue, 27 Jan 2015 10:15:19 GMT
S3 does not have the concept "directory". An S3 bucket only holds files
(objects). The hadoop filesystem is mapped onto a bucket and use
Hadoop-specific (or rather "s3tool"-specific: s3n uses the jets3t tool)
conventions(hacks) to fake directories such as a ending with a slash
("filename/") and with s3n by "filename_$folder$" (these are leaky
abstractions, google that if you ever have some spare time :p). S3 simply
doesn't (and shouldn't) know about these conventions. Again, a bucket just
holds a shitload of files. This might seem inconvenient but directories are
really bad idea for scalable storage. However, setting "folder-like"
permissions can be done through IAM:
http://docs.aws.amazon.com/AmazonS3/latest/dev/example-policies-s3.html#iam-policy-ex1

Summarizing: by setting permissions on /dev you set permissions on that
object. It has no effect on the file "/dev/output" which is, as far as S3
cares, another object that happens to share part of the objectname with
/dev.

Thomas Demoor
skype: demoor.thomas
mobile: +32 497883833

On Tue, Jan 27, 2015 at 6:33 AM, Chen, Kevin <Kevin.Chen@neustar.biz> wrote:

>  When spark saves rdd to a text file, the directory must not exist
> upfront. It will create a directory and write the data to part-0000 under
> that directory. In my use case, I create a directory dev in the bucket ://
> nexgen-software/dev . I expect it creates output direct under dev and a
> part-0000 under output. But it gave me exception as I only give write
> permission to dev not the bucket. If I open up write permission to bucket,
> it worked. But it did not create output directory under dev, it rather
> creates another dev/output directory under bucket. I just want to know if
> it is possible to have output directory created under dev directory I
> created upfront.
>
>   From: Nick Pentreath <nick.pentreath@gmail.com>
> Date: Monday, January 26, 2015 9:15 PM
> To: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: SaveAsTextFile to S3 bucket
>
>   Your output folder specifies
>
>  rdd.saveAsTextFile("s3n://nexgen-software/dev/output");
>
>  So it will try to write to /dev/output which is as expected. If you
> create the directory /dev/output upfront in your bucket, and try to save it
> to that (empty) directory, what is the behaviour?
>
> On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin <Kevin.Chen@neustar.biz>
> wrote:
>
>>  Does anyone know if I can save a RDD as a text file to a pre-created
>> directory in S3 bucket?
>>
>>  I have a directory created in S3 bucket: //nexgen-software/dev
>>
>>  When I tried to save a RDD as text file in this directory:
>> rdd.saveAsTextFile("s3n://nexgen-software/dev/output");
>>
>>
>>  I got following exception at runtime:
>>
>> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
>> org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' -
>> ResponseCode=403, ResponseMessage=Forbidden
>>
>>
>>  I have verified /dev has write permission. However, if I grant the
>> bucket //nexgen-software write permission, I don't get exception. But the
>> output is not created under dev. Rather, a different /dev/output directory
>> is created directory in the bucket (//nexgen-software). Is this how
>> saveAsTextFile behalves in S3? Is there anyway I can have output created
>> under a pre-defied directory.
>>
>>
>>  Thanks in advance.
>>
>>
>>
>>
>>
>

Mime
View raw message