following this suggestion, Aaron, you may take a look at Alluxio as the off-heap in-memory data storage as input/output for Spark jobs if that works for you.

See more intro on how to run Spark with Alluxio as data input / output.

http://www.alluxio.org/documentation/en/Running-Spark-on-Alluxio.html

- Bin

On Wed, Jun 29, 2016 at 8:40 AM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:
Have you looked at Alluxio? (earlier tachyon) 


On Wed, Jun 29, 2016 at 7:30 PM, Aaron Perrin <aperrin@timerazor.com> wrote:
The user guide describes a broadcast as a way to move a large dataset to each node:

"Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner."

And the broadcast example shows it being used with a variable.

But, is it somehow possible to instead broadcast a function that can be executed once, per node?

My use case is the following:

I have a large data structure that I currently create on each executor.  The way that I create it is a hack.  That is, when the RDD function is executed on the executor, I block, load a bunch of data (~250 GiB) from an external data source, create the data structure as a static object in the JVM, and then resume execution.  This works, but it ends up costing me a lot of extra memory (i.e. a few TiB when I have a lot of executors).

What I'd like to do is use the broadcast mechanism to load the data structure once, per node.  But, I can't serialize the data structure from the driver.

Any ideas?

Thanks!

Aaron