All of these tools are reasonable choices. I don't think the Spark project itself has a view on what works best. These things do different things. For example petastorm is not a training framework, but a way to feed data to a distributed DL training process on Spark. For what it's worth, Databricks ships Horovod and Petastorm, but that doesn't mean the other projects are second-class.

On Tue, Jun 1, 2021 at 4:59 PM Gourav Sengupta <gourav.sengupta.developer@gmail.com> wrote:
Dear TD, Matei, Michael, Reynold,

I hope all of you and your loved ones are staying safe and doing well.

as a member of the community the direction from the SPARK mentors is getting to be a bit confusing for me and I was wondering if I can seek your help.

We have to make long term decisions which is aligned with the open source SPARK compatibility and directions and it will be wonderful to know what is the most dependable route to get data from SPARK to tensorflow, is it:
1. petastorm
2. horovod
3. tensorflowonspark
4. spark_tensorflow_distributor
or something else.


Any comments from you will be super useful.

If I am not wrong, seamless integration between SPARK to tensorflow/ pytorch was one of the most exciting visions of SPARK 3.x

While using SPARK ML has its own favourite space, I think that tensorflow and pytorch will see a lot of focused development as well.


Regards,
Gourav Sengupta