hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zoran Dimitrijevic (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
Date Wed, 01 Apr 2015 17:01:01 GMT
Zoran Dimitrijevic created HADOOP-11785:

             Summary: Reduce number of listStatus operation in distcp buildListing()
                 Key: HADOOP-11785
                 URL: https://issues.apache.org/jira/browse/HADOOP-11785
             Project: Hadoop Common
          Issue Type: Improvement
          Components: tools/distcp
    Affects Versions: 3.0.0
            Reporter: Zoran Dimitrijevic
            Assignee: Zoran Dimitrijevic
            Priority: Minor
             Fix For: 3.0.0
         Attachments: distcp-liststatus.patch

Distcp was taking long time in copyListing.buildListing() for large source trees (I was using
source of 1.5M files in a tree of about 50K directories). For input at s3 buildListing was
taking more than one hour. I've noticed a performance bug in the current code which does listStatus
twice for each directory which doubles number of RPCs in some cases (if most directories do
not contain >1000 files).


This message was sent by Atlassian JIRA

View raw message