hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xi Fang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-5278) Perf: Distributed cache is broken when JT staging dir is not on the default FS
Date Tue, 28 May 2013 06:16:22 GMT
Xi Fang created MAPREDUCE-5278:
----------------------------------

             Summary: Perf: Distributed cache is broken when JT staging dir is not on the
default FS
                 Key: MAPREDUCE-5278
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5278
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: distributed-cache
    Affects Versions: 1-win
         Environment: Windows
            Reporter: Xi Fang


Today, we set the JobTracker staging dir ("mapreduce.jobtracker.staging.root.dir) to point
to HDFS even though ASV is the default file system. There are a few reason why this config
was chosen:
To prevent leak of the storage account creds to the user's storage account (IOW, keep job.xml
in the cluster). This is needed until HADOOP-444 is fixed.
It uses HDFS for the transient job files what is good for two reasons – a) it does not flood
the user's storage account with irrelevant data/files b) it leverages HDFS locality for small
files
However, this approach conflicts with how distributed cache caching works, completely negating
the feature's functionality.
When files are added to the distributed cache (thru files/achieves/libjars hadoop generic
options), they are copied to the job tracker staging dir only if they reside on a file system
different that the jobtracker's. Later on, this path is used as a "key" to cache the files
locally on the tasktracker's machine, and avoid localization (download/unzip) of the distributed
cache files if they are already localized.
In our configuration the caching is completely disabled and we always end up copying dist
cache files to the JT staging dir first and localizing them on the tasktracker machine second.
This is especially not good for Oozie scenarios as Oozie uses dist cache to populate Hive/Pig
jars throughout the cluster.
Easy workaround is to config mapreduce.jobtracker.staging.root.dir in mapred-site.xml to be
on the default FS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message