hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Yu (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1946) enhance FileInputFormat.setInputPaths()
Date Fri, 16 Jul 2010 03:12:50 GMT
enhance FileInputFormat.setInputPaths()

                 Key: MAPREDUCE-1946
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1946
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: job submission
    Affects Versions: 0.20.2
            Reporter: Ted Yu

FileInputFormat.setInputPaths(Job job, Path... inputPaths) can be enhanced in the following
3 ways:
1) when the input paths are known only at runtime, we need another form which accepts Collection<>
as second parameter. E.g. Set<Path> inputPaths
2) Use StringBuilder instead of StringBuffer because StringBuilder doesn't incur synchronization
3) The biggest performance boost comes from calling the following constructor of StringBuilder:
  public StringBuilder(int capacity)
capacity can be a 3rd parameter to setInputPaths() This would avoid excessive calls to Arrays.copyOf().

The following stack trace was observed when our code used FileInputFormat.addInputPath() many
times when a lot of files are eligible for processing:
java.lang.Thread.State: RUNNABLE
	at java.util.Arrays.copyOf(Arrays.java:2882)
	at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
	at java.lang.StringBuilder.append(StringBuilder.java:119)
	at org.apache.hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:330)
	at com.carrieriq.m2m.platform.mmp2.input.PackageInput.configureJobConf(PackageInput.java:336)

After incorporating all three optimizations, total time taken in customized setInputPaths(JobConf
conf, Set<Path> inputPaths) was 2 seconds. The combined time calling FileInputFormat.addInputPath()
was over 80 minutes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message