spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <>
Subject Re: Possible to broadcast a function?
Date Wed, 29 Jun 2016 15:40:25 GMT
Have you looked at Alluxio? (earlier tachyon)

Best Regards,
Founder, Nube Technologies <>
Reifier at Strata Hadoop World <>
Reifier at Spark Summit 2015


On Wed, Jun 29, 2016 at 7:30 PM, Aaron Perrin <> wrote:

> The user guide describes a broadcast as a way to move a large dataset to
> each node:
> "Broadcast variables allow the programmer to keep a read-only variable
> cached on each machine rather than shipping a copy of it with tasks. They
> can be used, for example, to give every node a copy of a large input
> dataset in an efficient manner."
> And the broadcast example shows it being used with a variable.
> But, is it somehow possible to instead broadcast a function that can be
> executed once, per node?
> My use case is the following:
> I have a large data structure that I currently create on each executor.
> The way that I create it is a hack.  That is, when the RDD function is
> executed on the executor, I block, load a bunch of data (~250 GiB) from an
> external data source, create the data structure as a static object in the
> JVM, and then resume execution.  This works, but it ends up costing me a
> lot of extra memory (i.e. a few TiB when I have a lot of executors).
> What I'd like to do is use the broadcast mechanism to load the data
> structure once, per node.  But, I can't serialize the data structure from
> the driver.
> Any ideas?
> Thanks!
> Aaron

View raw message