spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivaram Venkataraman <>
Subject Re: Integrating ML/DL frameworks with Spark
Date Tue, 08 May 2018 15:53:15 GMT
>>    - Fault tolerance and execution model: Spark assumes fine-grained
>>    task recovery, i.e. if something fails, only that task is rerun. This
>>    doesn’t match the execution model of distributed ML/DL frameworks that are
>>    typically MPI-based, and rerunning a single task would lead to the entire
>>    system hanging. A whole stage needs to be re-run.
>> This is not only useful for integrating with 3rd-party frameworks, but
> also useful for scaling MLlib algorithms. One of my earliest attempts in
> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
> <>). But we ended up with
> some compromised solutions. With the new execution model, we can set up a
> hybrid cluster and do all-reduce properly.
Is there a particular new execution model you are referring to or do we
plan to investigate a new execution model ?  For the MPI-like model, we
also need gang scheduling (i.e. schedule all tasks at once or none of them)
and I dont think we have support for that in the scheduler right now.

>> --
> Xiangrui Meng
> Software Engineer
> Databricks Inc. [image:] <>

View raw message