hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15492) increase performance of s3guard import command
Date Thu, 24 May 2018 14:10:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489073#comment-16489073
] 

Steve Loughran commented on HADOOP-15492:
-----------------------------------------

FWIW I'm thinking this could be used to for a fast update of a directory tree as maintenance,
but I don't think it's efficient enough yet

> increase performance of s3guard import command
> ----------------------------------------------
>
>                 Key: HADOOP-15492
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15492
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Steve Loughran
>            Priority: Major
>
> Some perf improvements which spring to mind having looked at the s3guard import command
> Key points: it can handle the import of a tree with existing data better
> # if the bucket is already under s3guard, then the listing will return all listed files,
which will then be put() again.
> # import calls {{putParentsIfNotPresent()}}, but DDBMetaStore.put() will do the parent
creation anyway
> # For each entry in the store (i.e. a file), the full parent listing is created, then
a batch write created to put all the parents and the actual file
> As a result, it's at risk of doing many more put calls than needed, especially for wide/deep
directory trees.
> It would be much more efficient to put all files in a single directory as part of 1+
batch request, with 1 parent tree. Better yet: a get() of that parent could skip the put of
parent entries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message