spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gylfi <gy...@berkeley.edu>
Subject Re: Passing Broadcast variable as parameter
Date Sat, 18 Jul 2015 09:03:44 GMT
Hi.

You can use a broadcast variable to make data available to all the nodes in
your cluster that can live longer then just the current distributed task. 

For example if you need a to access a large structure in multiple sub-tasks,
instead of sending that structure again and again with each sub-task you can
send it only once and access the data inside the operation (map, flatmap
etc.) by way of the broadcast variable name .value 

See :
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

Note however that you should treat the broadcast variable as a read-only
structure as it is not synced between workers after it is broadcasted.

To broadcast, your data must be serializable.

If the data you are trying to broadcast is a distributed RDD (and thus I
assumably large), perhaps what you need is some form of join operation (or
cogroup)? 

Regards, 
    Gylfi. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Passing-Broadcast-variable-as-parameter-tp23760p23898.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message