Yes, Nan, totally agree. To be on the same page, that's exactly what I wrote wasn't it?

On Tue, May 8, 2018 at 11:14 AM Nan Zhu <> wrote:
besides that, one of the things which is needed by multiple frameworks is to schedule tasks in a single wave


if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark is desired to provide a capability to ensure that either we run 50 tasks at once, or we should quit the complete application/job after some timeout period



On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <> wrote:
I think that's what Xiangrui was referring to. Instead of retrying a single task, retry the entire stage, and the entire stage of tasks need to be scheduled all at once.

On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <> wrote:

  • Fault tolerance and execution model: Spark assumes fine-grained task recovery, i.e. if something fails, only that task is rerun. This doesn’t match the execution model of distributed ML/DL frameworks that are typically MPI-based, and rerunning a single task would lead to the entire system hanging. A whole stage needs to be re-run.
This is not only useful for integrating with 3rd-party frameworks, but also useful for scaling MLlib algorithms. One of my earliest attempts in Spark MLlib was to implement All-Reduce primitive (SPARK-1485). But we ended up with some compromised solutions. With the new execution model, we can set up a hybrid cluster and do all-reduce properly.
Is there a particular new execution model you are referring to or do we plan to investigate a new execution model ?  For the MPI-like model, we also need gang scheduling (i.e. schedule all tasks at once or none of them) and I dont think we have support for that in the scheduler right now. 


Xiangrui Meng

Software Engineer

Databricks Inc.