spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Integrating ML/DL frameworks with Spark
Date Tue, 08 May 2018 18:16:10 GMT
Yes, Nan, totally agree. To be on the same page, that's exactly what I
wrote wasn't it?

On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcgill@gmail.com> wrote:

> besides that, one of the things which is needed by multiple frameworks is
> to schedule tasks in a single wave
>
> i.e.
>
> if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark
> is desired to provide a capability to ensure that either we run 50 tasks at
> once, or we should quit the complete application/job after some timeout
> period
>
> Best,
>
> Nan
>
> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <rxin@databricks.com> wrote:
>
>> I think that's what Xiangrui was referring to. Instead of retrying a
>> single task, retry the entire stage, and the entire stage of tasks need to
>> be scheduled all at once.
>>
>>
>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>>>
>>>>
>>>>>    - Fault tolerance and execution model: Spark assumes fine-grained
>>>>>    task recovery, i.e. if something fails, only that task is rerun. This
>>>>>    doesn’t match the execution model of distributed ML/DL frameworks
that are
>>>>>    typically MPI-based, and rerunning a single task would lead to the
entire
>>>>>    system hanging. A whole stage needs to be re-run.
>>>>>
>>>>> This is not only useful for integrating with 3rd-party frameworks, but
>>>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>>>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up
>>>> with some compromised solutions. With the new execution model, we can set
>>>> up a hybrid cluster and do all-reduce properly.
>>>>
>>>>
>>> Is there a particular new execution model you are referring to or do we
>>> plan to investigate a new execution model ?  For the MPI-like model, we
>>> also need gang scheduling (i.e. schedule all tasks at once or none of them)
>>> and I dont think we have support for that in the scheduler right now.
>>>
>>>>
>>>>> --
>>>>
>>>> Xiangrui Meng
>>>>
>>>> Software Engineer
>>>>
>>>> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>
>>>
>

Mime
View raw message