spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jepson (JIRA)" <>
Subject [jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
Date Fri, 22 Jun 2018 03:37:00 GMT


Jepson commented on SPARK-24579:

Very nice.

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> ----------------------------------------------------------------------------
>                 Key: SPARK-24579
>                 URL:
>             Project: Spark
>          Issue Type: Epic
>          Components: ML, PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Major
>              Labels: Hydrogen
>         Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange between
Apache Spark and DL%2FAI Frameworks .pdf
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache MXNet (incubating).
> Both big data and AI are indispensable components to drive business innovation and there
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI frameworks like
and tf.Transform. However, with 50+ data sources and built-in SQL, DataFrames, and Streaming
features, Spark remains the community choice for big data. This is why we saw many efforts
to integrate DL/AI frameworks with Spark to leverage its power, for example, TFRecords data
source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project Hydrogen, this
SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark and external
DL/AI frameworks. And the performance matters. However, there doesn’t exist a standard way
to exchange data and hence implementation and performance optimization fall into pieces. For
example, TensorFlowOnSpark uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords
to load and save data and pass the RDD records to TensorFlow in Python. And TensorFrames converts
Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s Java API. How can we
reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) between Spark
and DL/AI frameworks and optimize data conversion from/to this interface.  So DL/AI frameworks
can leverage Spark to load data virtually from anywhere without spending extra effort building
complex data solutions, like reading features from a production data warehouse or streaming
model inference. Spark users can use DL/AI frameworks without learning specific data APIs
implemented there. And developers from both sides can work on performance optimizations independently
given the interface itself doesn’t introduce big overhead.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message