spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yong Zhang <java8...@hotmail.com>
Subject RE: Possible to broadcast a function?
Date Thu, 30 Jun 2016 12:45:32 GMT
How about this old discussion related to similar problem as yours.
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-td3203.html
Yong

From: aperrin@timerazor.com
Date: Wed, 29 Jun 2016 14:00:07 +0000
Subject: Possible to broadcast a function?
To: user@spark.apache.org

The user guide describes a broadcast as a way to move a large dataset to each node:

"Broadcast variables allow the programmer to keep a read-only variable cached on each machine
rather
than shipping a copy of it with tasks. They can be used, for example, to give every node a
copy of a
large input dataset in an efficient manner."

And the broadcast example shows it being used with a variable.

But, is it somehow possible to instead broadcast a function that can be executed once, per
node?

My use case is the following:

I have a large data structure that I currently create on each executor.  The way that I create
it is a hack.  That is, when the RDD function is executed on the executor, I block, load a
bunch of data (~250 GiB) from an external data source, create the data structure as a static
object in the JVM, and then resume execution.  This works, but it ends up costing me a lot
of extra memory (i.e. a few TiB when I have a lot of executors).

What I'd like to do is use the broadcast mechanism to load the data structure once, per node.
 But, I can't serialize the data structure from the driver.

Any ideas?

Thanks!

Aaron

 		 	   		  
Mime
View raw message