spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-2688) Need a way to run multiple data pipeline concurrently
Date Sun, 25 Jan 2015 15:31:35 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291144#comment-14291144
] 

Sean Owen commented on SPARK-2688:
----------------------------------

I am still not clear on what you are trying to do that isn't possible with Spark now, using
persistence. Can anyone who disagrees give a concrete example? The example in this JIRA does
not appear to be such a thing. Many stages can already depend on one stage. It does not imply
anything hits disk or is recomputed. 

> Need a way to run multiple data pipeline concurrently
> -----------------------------------------------------
>
>                 Key: SPARK-2688
>                 URL: https://issues.apache.org/jira/browse/SPARK-2688
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>            | -> rdd4
>            | -> rdd5
>            \ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an intermediate
rdd2. We use foreach(fn) with a dummy function to trigger the execution. However, rdd.foreach(fn)
only trigger pipeline rdd1 -> rdd2 -> rdd3. To make things worse, when we call rdd4.foreach(),
rdd2 will be recomputed. This is very inefficient. Ideally, we should be able to trigger the
execution the whole graph and reuse rdd2, but there doesn't seem to be a way doing so. Tez
already realized the importance of this (TEZ-391), so I think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message