hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Klimontovich (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-2235) JobTracker "over-synchronization" makes it hang up in certain cases
Date Mon, 27 Dec 2010 15:25:45 GMT
JobTracker "over-synchronization" makes it hang up in certain cases 

                 Key: MAPREDUCE-2235
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2235
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: jobtracker
    Affects Versions: 0.21.0, 0.20.2, 0.20.1
            Reporter: Vladimir Klimontovich

There is a genaral problem in JobTracker.java code: it's using "this" synchronization everywhere
so only one method could be executed at one moment. When the job submit rate is low (lower
then one job in several seconds) tracker's working without a problem. When the job rate is
high the following problem occurs:

Inside submitJob() JT copies job jar + xml to local filesystem. After that it's doing "chmod"
on those files. Hadoop does chmod  by spawning child process. When JT heap is big (like several
gigabytes) spawning child process takes a lot of time (because java calls fork()) — in our
case it's about 1-2 seconds. So job tracker can't handle high frequency job submits.

Except of that, as heartbeat() method is also synchronized JT stops to process heart-beat
as "this" monitor is being held by submit job. That makes JT thins that a lot of TaskTrackers
are down.

Following solution could help:

"chmod" is being called from submitJob() method under following line:

JobInProgress job = new JobInProgress(jobId, this, this.conf);

This block could be taken away from synchronized code:

public JobStatus submitJob(JobID jobId) throws IOException {
    synchronized (this) {
        .... the rest

    //here we're leaving this line outside syncronized code as it doesn't relate
    //on state of JobTracker. Also this line

    JobInProgress job = new JobInProgress(jobId, this, this.conf);

    synchronized (this) {
         .... the rest

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message