Hi Riccardo,

Yes, you can run Tensorflow distributed training (and inference) inline with PySpark; see some examples at https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/examples/tensorflow/tfpark/estimator_dataset.py (using TF Keras API), https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/examples/tensorflow/tfpark/estimator_dataset.py (using TF Estimator API) and https://github.com/intel-analytics/analytics-zoo/tree/master/pyzoo/zoo/examples/tensorflow/distributed_training.

For Keras API support in Analytics Zoo, it's a new implementation of Keras 1.2.2 on Spark (using BigDL).


On Mon, May 6, 2019 at 5:37 AM Riccardo Ferrari <ferrarir@gmail.com> wrote:
Thanks everyone, I really appreciate your contributions here.

@Jason, thanks for the references I'll take a look. Quickly checking github: https://github.com/intel-analytics/analytics-zoo#distributed-tensorflow-and-keras-on-sparkbigdl
Do I understand correctly I can:
  • Prepare my data with Spark
  • Define a Tensorflow model
  • Train it in distributed fashion
When using the Keras API, is it the real Keras with just an adapter layer or it si a completely different API that mimic Keras?

@Gurav, I agree that "you should pick the right tool for the job".

The purpose of this discussion is to understand/explore if we really need another stack or we can leverage on the existing infrastructure and expertise to accomplish the task.
We currently have some ML jobs and Spark proved to be the perfect fit for us. We do know it enough to be confident we can deliver what is asked, it scale, it is reslient, it works.

We are starting to evaluate/introduce some DL models, being able to leverage on the existing infra it would be a big plus. It is not only having to deal with a new set of machines running a different stack (ie tensorflow, mxnet, ...) it is everything around it, tuning, managing, packing applications, testing and so on. Are reasonable concerns?


On Sun, May 5, 2019 at 8:06 PM Gourav Sengupta <gourav.sengupta@gmail.com> wrote:
If someone is trying to actually use deep learning algorithms, their focus should be in choosing the technology stack which gives them maximum flexibility to try the nuances of their algorithms.

From a personal perspective, I always prefer to use libraries which provides the best flexibility and extensibility in terms of the science/ mathematics of the subjects. For example try to open a book on Linear Regression and then try to see whether all the mathematical formulations are available in the SPARK module for regression or not. 

It is always better to choose a technology that fits into the nuances and perfection of the science, rather than choose a technology and then try to fit the science into it.


On Sun, May 5, 2019 at 2:23 PM Jason Dai <jason.dai@gmail.com> wrote:
You may find talks from Analytics Zoo users at  https://analytics-zoo.github.io/master/#presentations/; in particular, some of recent user examples on Analytics Zoo:

On Sun, May 5, 2019 at 6:29 AM Riccardo Ferrari <ferrarir@gmail.com> wrote:
Thank you for your answers!

While it is clear each DL framework can solve the distributed model training on their own (some better than others).  Still I see a lot of value of having Spark on the ETL/pre-processing part, thus the origin of my question.
I am trying to avoid to mange multiple stacks/workflows and hoping to unify my system. Projects like TensorflowOnSpark or Analytics-Zoo (to name couple) feels like they can help, still I really appreciate your comments and anyone that could add some value to this discussion. Does anyone have experience with them?


On Sat, May 4, 2019 at 8:01 PM Pat Ferrel <pat@occamsmachete.com> wrote:

Spark does not do the DL learning part of the pipeline (afaik) so it is limited to data ingestion and transforms (ETL). It therefore is optional and other ETL options might be better for you. 

Most of the technologies @Gourav mentions have their own scaling based on their own compute engines specialized for their DL implementations, so be aware that Spark scaling has nothing to do with scaling most of the DL engines, they have their own solutions.

From: Gourav Sengupta <gourav.sengupta@gmail.com>
Reply: Gourav Sengupta <gourav.sengupta@gmail.com>
Date: May 4, 2019 at 10:24:29 AM
To: Riccardo Ferrari <ferrarir@gmail.com>
Cc: User <user@spark.apache.org>
Subject:  Re: Deep Learning with Spark, what is your experience?

On Sat, May 4, 2019 at 10:59 AM Riccardo Ferrari <ferrarir@gmail.com> wrote:
Hi list,

I am trying to undestand if ti make sense to leverage on Spark as enabling platform for Deep Learning.

My open question to you are:
  • Do you use Apache Spark in you DL pipelines?
  • How do you use Spark for DL? Is it just a stand-alone stage in the workflow (ie data preparation script) or is it  more integrated
I see a major advantage in leveraging on Spark as a unified entrypoint, for example you can easily abstract data sources and leverage on existing team skills for data pre-processing and training. On the flip side you may hit some limitations including supported versions and so on.
What is your experience?