hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Victor Zhang (Jira)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-7233) MapReduce Input Path Should Ignore Path Ends With '/*' When Job Submit
Date Tue, 20 Aug 2019 09:10:00 GMT
Victor Zhang created MAPREDUCE-7233:

             Summary: MapReduce Input Path Should Ignore Path Ends With '/*' When Job Submit
                 Key: MAPREDUCE-7233
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7233
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: job submission, performance
    Affects Versions: 2.7.2
            Reporter: Victor Zhang
         Attachments: job submit.jpg

We have a public and shared hadoop cluster that runs so many MR job from different department.


I found that job submission very slow once the input path of the job set to a path ends with
"/*", like "/my/path/*", but "/my/path" or "/my/path/" works fine.


After read the code. I think the problem lies in  the process of splits calculation.


FileInputFormat#singleThreadedListStatus() method get a array of FileStatus first. If the
input path ends with "/*", and the result is all file/directory FileStatus object in the input
path. But only one FileStatus object(the input path) if the input path not ends with "/*".


The next step is find the LocatedFileStatus of each FileStatus object. so, only the directory
FileStatus do searching the LocatedFileStatus(dfs.listPaths(), batch).


Finally, when calculate job split like FileInputFormat#getSplits() method. If the FileStatus
is not LocatedFileStatus object, then use fs.getFileBlockLocations() method to fetch. Which
could lead a lot of RPC requests when many files in the input path. CombineFileInputFormat
do this also in the construction method of OneFileInfo.


So, in this case, some job take a few minutes/hours to submit.


I tried to remove the suffix of the input path that ends with "/*" before the code that get
file status, but I don't confirm if this will cause other problems.

This message was sent by Atlassian Jira

To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org

View raw message