hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13371) S3A globber to use bulk listObject call over recursive directory scan
Date Sat, 18 Mar 2017 00:51:42 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930952#comment-15930952
] 

ASF GitHub Bot commented on HADOOP-13371:
-----------------------------------------

GitHub user kazuyukitanimura opened a pull request:

    https://github.com/apache/hadoop/pull/203

    HADOOP-13371. S3A globber to use bulk listObject call over recursive directory scan

    Hi @steveloughran 
    
    This pull request is for fixing (mitigating) the issue of [HADOOP-13371](https://issues.apache.org/jira/browse/HADOOP-13371).
    
    With this patch, it now passes the filter before glob happens.
    
    I had an issue of getting OOM for globbing large s3 buckets before since it kept all possible
paths and the filtering happened at the end. Now this patch prunes unnecessary paths with
the filter first. I applied this patch to our production pipelines, things run flawlessly.
    This should be applicable to branch-2.8 as well.
    
    Thanks in advance for reviewing this.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bloomreach/hadoop trunk

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hadoop/pull/203.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #203
    
----
commit 5d6b3e1ebb97cc11479db6c30b0a1a04986c4967
Author: kazu <kazu@bloomreach.com>
Date:   2017-03-18T00:24:41Z

    HADOOP-13371. S3A globber to use bulk listObject call over recursive directory scan

----


> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-13371
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13371
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in {{FileSystem.listStatus}} calls,
but doesn't do anything for {{FileSystem.globStatus()}}, which uses a completely different
codepath, one which does a selective recursive scan by pattern matching as it goes down, filtering
out those patterns which don't match. Cost is O(matching-directories) + cost of examining
the files.
> It should be possible to do the glob status listing in S3A not through the filtered treewalk,
but through a list + filter operation. This would be an O(files) lookup *before any filtering
took place*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message