hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11785) Reduce number of listStatus operation in distcp buildListing()
Date Wed, 01 Apr 2015 20:51:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391432#comment-14391432
] 

Colin Patrick McCabe commented on HADOOP-11785:
-----------------------------------------------

Thanks, [~3opan].  This looks good in general.

bq. Should I mark this as a bug fix instead of improvement?

I don't see this as a bug because the functionality is correct.  It seems to be an improvement.

{code}
-   * Collect the list of 
+   * Collect the list of
-   *     the the source root is a directory, then the source root entry is not 
+   *     the the source root is a directory, then the source root entry is not
-    if (fileStatus.getPath().equals(sourcePathRoot) && 
+    if (fileStatus.getPath().equals(sourcePathRoot) &&
{code}

Can you remove these whitespace changes from the patch?  It's distracting and it makes it
look like things have changed, when in fact they have not.  I think there are a few other
whitespace changes as well.

{{traverseDirectory}}: Maybe we can optimize this even more.  Can we pass in the sourceFS
to this function, rather than calling {{Path#getFileSystem}}?  {{Path#getFileSystem}} requires
some synchronization which might add overheads.

It looks good aside from that.  thanks

> Reduce number of listStatus operation in distcp buildListing()
> --------------------------------------------------------------
>
>                 Key: HADOOP-11785
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11785
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 3.0.0
>            Reporter: Zoran Dimitrijevic
>            Assignee: Zoran Dimitrijevic
>            Priority: Minor
>         Attachments: distcp-liststatus.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Distcp was taking long time in copyListing.buildListing() for large source trees (I was
using source of 1.5M files in a tree of about 50K directories). For input at s3 buildListing
was taking more than one hour. I've noticed a performance bug in the current code which does
listStatus twice for each directory which doubles number of RPCs in some cases (if most directories
do not contain >1000 files).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message