spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Cutler <cutl...@gmail.com>
Subject Re: Integrating ML/DL frameworks with Spark
Date Tue, 15 May 2018 06:37:20 GMT
Thanks for starting this discussion, I'd also like to see some improvements
in this area and glad to hear that the Pandas UDFs / Arrow functionality
might be useful.  I'm wondering if from your initial investigations you
found anything lacking from the Arrow format or possible improvements that
would simplify the data representation?  Also, while data could be handed
off in a UDF, would it make sense to also discuss a more formal way to
externalize the data in a way that would also work for the Scala API?

Thanks,
Bryan

On Wed, May 9, 2018 at 4:31 PM, Xiangrui Meng <meng@databricks.com> wrote:

> Shivaram: Yes, we can call it "gang scheduling" or "barrier
> synchronization". Spark doesn't support it now. The proposal is to have a
> proper support in Spark's job scheduler, so we can integrate well with
> MPI-like frameworks.
>
>
> On Tue, May 8, 2018 at 11:17 AM Nan Zhu <zhunanmcgill@gmail.com> wrote:
>
>> .....how I skipped the last part........
>>
>> On Tue, May 8, 2018 at 11:16 AM, Reynold Xin <rxin@databricks.com> wrote:
>>
>>> Yes, Nan, totally agree. To be on the same page, that's exactly what I
>>> wrote wasn't it?
>>>
>>> On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcgill@gmail.com> wrote:
>>>
>>>> besides that, one of the things which is needed by multiple frameworks
>>>> is to schedule tasks in a single wave
>>>>
>>>> i.e.
>>>>
>>>> if some frameworks like xgboost/mxnet requires 50 parallel workers,
>>>> Spark is desired to provide a capability to ensure that either we run 50
>>>> tasks at once, or we should quit the complete application/job after some
>>>> timeout period
>>>>
>>>> Best,
>>>>
>>>> Nan
>>>>
>>>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <rxin@databricks.com>
>>>> wrote:
>>>>
>>>>> I think that's what Xiangrui was referring to. Instead of retrying a
>>>>> single task, retry the entire stage, and the entire stage of tasks need
to
>>>>> be scheduled all at once.
>>>>>
>>>>>
>>>>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>>>>> shivaram@eecs.berkeley.edu> wrote:
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>    - Fault tolerance and execution model: Spark assumes
>>>>>>>>    fine-grained task recovery, i.e. if something fails, only
that task is
>>>>>>>>    rerun. This doesn’t match the execution model of distributed
ML/DL
>>>>>>>>    frameworks that are typically MPI-based, and rerunning
a single task would
>>>>>>>>    lead to the entire system hanging. A whole stage needs
to be re-run.
>>>>>>>>
>>>>>>>> This is not only useful for integrating with 3rd-party frameworks,
>>>>>>> but also useful for scaling MLlib algorithms. One of my earliest
attempts
>>>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But
we ended
>>>>>>> up with some compromised solutions. With the new execution model,
we can
>>>>>>> set up a hybrid cluster and do all-reduce properly.
>>>>>>>
>>>>>>>
>>>>>> Is there a particular new execution model you are referring to or
do
>>>>>> we plan to investigate a new execution model ?  For the MPI-like
model, we
>>>>>> also need gang scheduling (i.e. schedule all tasks at once or none
of them)
>>>>>> and I dont think we have support for that in the scheduler right
now.
>>>>>>
>>>>>>>
>>>>>>>> --
>>>>>>>
>>>>>>> Xiangrui Meng
>>>>>>>
>>>>>>> Software Engineer
>>>>>>>
>>>>>>> Databricks Inc. [image: http://databricks.com]
>>>>>>> <http://databricks.com/>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>> --
>
> Xiangrui Meng
>
> Software Engineer
>
> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>

Mime
View raw message