spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-16929) Speculation-related synchronization bottleneck in checkSpeculatableTasks
Date Thu, 09 Feb 2017 07:57:42 GMT


Apache Spark reassigned SPARK-16929:

    Assignee:     (was: Apache Spark)

> Speculation-related synchronization bottleneck in checkSpeculatableTasks
> ------------------------------------------------------------------------
>                 Key: SPARK-16929
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Nicholas Brown
> Our cluster has been running slowly since I got speculation working, I looked into it
and noticed that stderr was saying some tasks were taking almost an hour to run even though
in the application logs on the nodes that task only took a minute or so to run.  Digging into
the thread dump for the master node I noticed a number of threads are blocked, apparently
by speculation thread.  At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler
while it looks through the tasks to see what needs to be rerun.  Unfortunately that code loops
through each of the tasks, so when you have even just a couple hundred thousand tasks to run
that can be prohibitively slow to run inside of a synchronized block.  Once I disabled speculation,
the job went back to having acceptable performance.
> There are no comments around that lock indicating why it was added, and the git history
seems to have a couple refactorings so its hard to find where it was added.  I'm tempted to
believe it is the result of someone assuming that an extra synchronized block never hurt anyone
(in reality I've probably just as many bugs caused by over synchronization as too little)
as it looks too broad to be actually guarding any potential concurrency issue.  But, since
concurrency issues can be tricky to reproduce (and yes, I understand that's an extreme understatement)
I'm not sure just blindly removing it without being familiar with the history is necessarily
> Can someone look into this?  Or at least make a note in the documentation that speculation
should not be used with large clusters?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message