spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jepson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
Date Fri, 22 Jun 2018 03:37:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519956#comment-16519956
] 

Jepson commented on SPARK-24579:
--------------------------------

Very nice.

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-24579
>                 URL: https://issues.apache.org/jira/browse/SPARK-24579
>             Project: Spark
>          Issue Type: Epic
>          Components: ML, PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Major
>              Labels: Hydrogen
>         Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange between
Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache MXNet (incubating).
> Both big data and AI are indispensable components to drive business innovation and there
have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI frameworks like tf.data
and tf.Transform. However, with 50+ data sources and built-in SQL, DataFrames, and Streaming
features, Spark remains the community choice for big data. This is why we saw many efforts
to integrate DL/AI frameworks with Spark to leverage its power, for example, TFRecords data
source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project Hydrogen, this
SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark and external
DL/AI frameworks. And the performance matters. However, there doesn’t exist a standard way
to exchange data and hence implementation and performance optimization fall into pieces. For
example, TensorFlowOnSpark uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords
to load and save data and pass the RDD records to TensorFlow in Python. And TensorFrames converts
Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s Java API. How can we
reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) between Spark
and DL/AI frameworks and optimize data conversion from/to this interface.  So DL/AI frameworks
can leverage Spark to load data virtually from anywhere without spending extra effort building
complex data solutions, like reading features from a production data warehouse or streaming
model inference. Spark users can use DL/AI frameworks without learning specific data APIs
implemented there. And developers from both sides can work on performance optimizations independently
given the interface itself doesn’t introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message