spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: small (yet major) change going in: broadcasting RDD to reduce task size
Date Thu, 17 Jul 2014 05:12:25 GMT
Hey Reynold, just to clarify, users will still have to manually broadcast objects that they
want to use *across* operations (e.g. in multiple iterations of an algorithm, or multiple
map functions, or stuff like that). But they won't have to broadcast something they only use
once.

Matei

On Jul 16, 2014, at 10:07 PM, Reynold Xin <rxin@databricks.com> wrote:

> Oops - the pull request should be https://github.com/apache/spark/pull/1452
> 
> 
> On Wed, Jul 16, 2014 at 10:06 PM, Reynold Xin <rxin@databricks.com> wrote:
> 
>> Hi Spark devs,
>> 
>> Want to give you guys a heads up that I'm working on a small (but major)
>> change with respect to how task dispatching works. Currently (as of Spark
>> 1.0.1), Spark sends RDD object and closures using Akka along with the task
>> itself to the executors. This is however inefficient because all tasks in
>> the same stage use the same RDDs and closures, but we have to send these
>> closures and RDDs multiple times to the executors. This is especially bad
>> when some closure references some variable that is very large. The current
>> design led to users having to explicitly broadcast large variables.
>> 
>> The patch uses broadcast to send RDD objects and the closures to
>> executors, and use Akka to only send a reference to the broadcast
>> RDD/closure along with the partition specific information for the task. For
>> those of you who know more about the internals, Spark already relies on
>> broadcast to send the Hadoop JobConf every time it uses the Hadoop input,
>> because the JobConf is large.
>> 
>> The user-facing impact of the change include:
>> 
>> 1. Users won't need to decide what to broadcast anymore
>> 2. Task size will get smaller, resulting in faster scheduling and higher
>> task dispatch throughput.
>> 
>> In addition, the change will simplify some internals of Spark, removing
>> the need to maintain task caches and the complex logic to broadcast JobConf
>> (which also led to a deadlock recently).
>> 
>> 
>> Pull request attached: https://github.com/apache/spark/pull/1450
>> 
>> 
>> 
>> 
>> 


Mime
View raw message