hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amandeep Khurana (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9454) Support multipart uploads for s3native
Date Wed, 26 Feb 2014 20:11:29 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13913436#comment-13913436
] 

Amandeep Khurana commented on HADOOP-9454:
------------------------------------------

bq. Would it not be better to replace the Jets3t implementation with one backed by AWS's own
SDK? S3 vs S3N is confusing enough for folks, IMHO better to not add additional choices into
the mix.

Yes, absolutely. If [~aloisius] submits a patch, we should commit it and have it be an option
in parallel to the current s3n option and deprecate s3 and s3n implementations in due course.

I just reviewed this patch and it looks good to me. The only thing is that this does not parallelize
movement of large files using MR. So, multiple mappers can't upload different parts of a large
file. Also, I don't know for a fact that it's possible to split a file into multiple parts
and have individual mappers do the uploads with the current implementation, and if it is,
it's not without significant changes to this patch.

Having said that, I think this patch can be put in and we can open another jira for enhancements.

> Support multipart uploads for s3native
> --------------------------------------
>
>                 Key: HADOOP-9454
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9454
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Jordan Mendelson
>            Assignee: Akira AJISAKA
>         Attachments: HADOOP-9454-10.patch, HADOOP-9454-11.patch, HADOOP-9454-12.patch
>
>
> The s3native filesystem is limited to 5 GB file uploads to S3, however the newest version
of jets3t supports multipart uploads to allow storing multi-TB files. While the s3 filesystem
lets you bypass this restriction by uploading blocks, it is necessary for us to output our
data into Amazon's publicdatasets bucket which is shared with others.
> Amazon has added a similar feature to their distribution of hadoop as has MapR.
> Please note that while this supports large copies, it does not yet support parallel copies
because jets3t doesn't expose an API yet that allows it without hadoop controlling the threads
unlike with upload.
> By default, this patch does not enable multipart uploads. To enable them and parallel
uploads:
> add the following keys to your hadoop config:
> <property>
>   <name>fs.s3n.multipart.uploads.enabled</name>
>   <value>true</value>
> </property>
> <property>
>   <name>fs.s3n.multipart.uploads.block.size</name>
>   <value>67108864</value>
> </property>
> <property>
>   <name>fs.s3n.multipart.copy.block.size</name>
>   <value>5368709120</value>
> </property>
> create a /etc/hadoop/conf/jets3t.properties file with or similar to:
> storage-service.internal-error-retry-max=5
> storage-service.disable-live-md5=false
> threaded-service.max-thread-count=20
> threaded-service.admin-max-thread-count=20
> s3service.max-thread-count=20
> s3service.admin-max-thread-count=20



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message