spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nan Zhu <zhunanmcg...@gmail.com>
Subject Re: Integrating ML/DL frameworks with Spark
Date Tue, 08 May 2018 18:17:36 GMT
.....how I skipped the last part........

On Tue, May 8, 2018 at 11:16 AM, Reynold Xin <rxin@databricks.com> wrote:

> Yes, Nan, totally agree. To be on the same page, that's exactly what I
> wrote wasn't it?
>
> On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcgill@gmail.com> wrote:
>
>> besides that, one of the things which is needed by multiple frameworks is
>> to schedule tasks in a single wave
>>
>> i.e.
>>
>> if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark
>> is desired to provide a capability to ensure that either we run 50 tasks at
>> once, or we should quit the complete application/job after some timeout
>> period
>>
>> Best,
>>
>> Nan
>>
>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <rxin@databricks.com> wrote:
>>
>>> I think that's what Xiangrui was referring to. Instead of retrying a
>>> single task, retry the entire stage, and the entire stage of tasks need to
>>> be scheduled all at once.
>>>
>>>
>>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>>> shivaram@eecs.berkeley.edu> wrote:
>>>
>>>>
>>>>>
>>>>>>    - Fault tolerance and execution model: Spark assumes fine-grained
>>>>>>    task recovery, i.e. if something fails, only that task is rerun.
This
>>>>>>    doesn’t match the execution model of distributed ML/DL frameworks
that are
>>>>>>    typically MPI-based, and rerunning a single task would lead to
the entire
>>>>>>    system hanging. A whole stage needs to be re-run.
>>>>>>
>>>>>> This is not only useful for integrating with 3rd-party frameworks,
>>>>> but also useful for scaling MLlib algorithms. One of my earliest attempts
>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended
up
>>>>> with some compromised solutions. With the new execution model, we can
set
>>>>> up a hybrid cluster and do all-reduce properly.
>>>>>
>>>>>
>>>> Is there a particular new execution model you are referring to or do we
>>>> plan to investigate a new execution model ?  For the MPI-like model, we
>>>> also need gang scheduling (i.e. schedule all tasks at once or none of them)
>>>> and I dont think we have support for that in the scheduler right now.
>>>>
>>>>>
>>>>>> --
>>>>>
>>>>> Xiangrui Meng
>>>>>
>>>>> Software Engineer
>>>>>
>>>>> Databricks Inc. [image: http://databricks.com]
>>>>> <http://databricks.com/>
>>>>>
>>>>
>>>>
>>

Mime
View raw message