hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Huo (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-6759) JobSubmitter/JobResourceUploader should parallelize upload of -libjars, -files, -archives
Date Wed, 17 Aug 2016 17:39:20 GMT
Dennis Huo created MAPREDUCE-6759:
-------------------------------------

             Summary: JobSubmitter/JobResourceUploader should parallelize upload of -libjars,
-files, -archives
                 Key: MAPREDUCE-6759
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6759
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: job submission
            Reporter: Dennis Huo


During job submission, the {{JobResourceUploader}} currently iterates over for-loops of {{-libjars}},
{{-files}}, and {{-archives}} sequentially, which can significantly slow down job startup
time when a large number of files need to be uploaded, especially if staging the files to
a cloud object-store based FileSystem implementation like S3, GCS, WABS, etc., where round-trip
latencies may be higher than HDFS despite having good throughput when parallelized:

{code:title=JobResourceUploader.java}
    if (files != null) {
      FileSystem.mkdirs(jtFs, filesDir, mapredSysPerms);
      String[] fileArr = files.split(",");
      for (String tmpFile : fileArr) {
        URI tmpURI = null;
        try {
          tmpURI = new URI(tmpFile);
        } catch (URISyntaxException e) {
          throw new IllegalArgumentException(e);
        }
        Path tmp = new Path(tmpURI);
        Path newPath = copyRemoteFiles(filesDir, tmp, conf, replication);
        try {
          URI pathURI = getPathURI(newPath, tmpURI.getFragment());
          DistributedCache.addCacheFile(pathURI, conf);
        } catch (URISyntaxException ue) {
          // should not throw a uri exception
          throw new IOException("Failed to create uri for " + tmpFile, ue);
        }
      }
    }

    if (libjars != null) {
      FileSystem.mkdirs(jtFs, libjarsDir, mapredSysPerms);
      String[] libjarsArr = libjars.split(",");
      for (String tmpjars : libjarsArr) {
        Path tmp = new Path(tmpjars);
        Path newPath = copyRemoteFiles(libjarsDir, tmp, conf, replication);
        DistributedCache.addFileToClassPath(
            new Path(newPath.toUri().getPath()), conf, jtFs);
      }
    }

    if (archives != null) {
      FileSystem.mkdirs(jtFs, archivesDir, mapredSysPerms);
      String[] archivesArr = archives.split(",");
      for (String tmpArchives : archivesArr) {
        URI tmpURI;
        try {
          tmpURI = new URI(tmpArchives);
        } catch (URISyntaxException e) {
          throw new IllegalArgumentException(e);
        }
        Path tmp = new Path(tmpURI);
        Path newPath = copyRemoteFiles(archivesDir, tmp, conf, replication);
        try {
          URI pathURI = getPathURI(newPath, tmpURI.getFragment());
          DistributedCache.addCacheArchive(pathURI, conf);
        } catch (URISyntaxException ue) {
          // should not throw an uri excpetion
          throw new IOException("Failed to create uri for " + tmpArchives, ue);
        }
      }
    }
{code}

Parallelizing the upload of these files would improve job submission time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Mime
View raw message