spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Naveen Swamy <>
Subject Re: Integrating ML/DL frameworks with Spark
Date Tue, 08 May 2018 18:08:38 GMT
I am committer on the MXNet project and very interested in working on
Integrating with Spark.
I am wondering how would training proceed in case of
1)  training is done on one host with multiple GPUs -- I don't know if
Spark's capabilities can leveraged here
2) distributed training with data parallelism -- how can we leverage
Spark's map reduce model to fit distributed training. model of execution
here is more of iterative in nature.

Please let me know.

Thanks, Naveen

On Tue, May 8, 2018 at 8:53 AM, Shivaram Venkataraman <> wrote:

>>>    - Fault tolerance and execution model: Spark assumes fine-grained
>>>    task recovery, i.e. if something fails, only that task is rerun. This
>>>    doesn’t match the execution model of distributed ML/DL frameworks that are
>>>    typically MPI-based, and rerunning a single task would lead to the entire
>>>    system hanging. A whole stage needs to be re-run.
>>> This is not only useful for integrating with 3rd-party frameworks, but
>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>> <>). But we ended up
>> with some compromised solutions. With the new execution model, we can set
>> up a hybrid cluster and do all-reduce properly.
> Is there a particular new execution model you are referring to or do we
> plan to investigate a new execution model ?  For the MPI-like model, we
> also need gang scheduling (i.e. schedule all tasks at once or none of them)
> and I dont think we have support for that in the scheduler right now.
>>> --
>> Xiangrui Meng
>> Software Engineer
>> Databricks Inc. [image:] <>

View raw message