spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
Date Sun, 25 Jan 2015 15:51:35 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291152#comment-14291152
] 

Sean Owen commented on SPARK-3621:
----------------------------------

Hm, what is an example? I think you mean "collect an RDD directly to every executor in its
entirety." That's not an operation today but makes some sense. 

However my first question is, is this really something you need RDDs to do? You can already
side load whatever you want on executors without involving the driver. 

The original description however talks about sharing data between stages. Is this not just
a matter of persisting an RDD? This also does not involve the driver. 

> Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that
a job can access
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3621
>                 URL: https://issues.apache.org/jira/browse/SPARK-3621
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Xuefu Zhang
>
> In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow
client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting
a variable made of RDDs requires all RDD data be collected to the driver and that the variable
be shipped to the cluster after being made. It would be more performing if driver just broadcasts
the RDDs and uses the corresponding data in jobs (such building hashmaps at executors).
> Tez has a broadcast edge which can ship data from previous stage to the next stage, which
doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message