hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Huo (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-6758) TestDFSIO should parallelize its creation of control files on setup
Date Wed, 17 Aug 2016 01:57:20 GMT
Dennis Huo created MAPREDUCE-6758:

             Summary: TestDFSIO should parallelize its creation of control files on setup
                 Key: MAPREDUCE-6758
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6758
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: test
            Reporter: Dennis Huo

TestDFSIO currently performs a sequential for-loop to create {{nrFiles}} control files in
the {{controlDir}} which is a subdirectory of the overall {{test.build.data}} directory, which
may be a non-HDFS FileSystem implementation:

private void createControlFile(FileSystem fs,
                                long nrBytes, // in bytes
                                int nrFiles
                              ) throws IOException {
  LOG.info("creating control file: "+nrBytes+" bytes, "+nrFiles+" files");

  Path controlDir = getControlDir(config);
  fs.delete(controlDir, true);

  for(int i=0; i < nrFiles; i++) {
    String name = getFileName(i);
    Path controlFile = new Path(controlDir, "in_file_" + name);
    SequenceFile.Writer writer = null;
    try {
      writer = SequenceFile.createWriter(fs, config, controlFile,
                                         Text.class, LongWritable.class,
      writer.append(new Text(name), new LongWritable(nrBytes));
    } catch(Exception e) {
      throw new IOException(e.getLocalizedMessage());
    } finally {
      if (writer != null)
      writer = null;
  LOG.info("created control files for: "+nrFiles+" files");

When testing in an object-store based filesystem with higher round-trip latency than HDFS
(like S3 or GCS), this means job setup that might only take seconds in HDFS ends up taking
minutes or even tens of minutes against the object stores if the test is using thousands of
control files. In the same vein as other JIRAs in [https://issues.apache.org/jira/browse/HADOOP-11694],
the control-file creation should be parallelized/multithreaded to efficiently launch large
TestDFSIO jobs against FileSystem impls with high round-trip latency but which can still support
high overall throughput/QPS.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org

View raw message