spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Perrin <aper...@timerazor.com>
Subject Re: Possible to broadcast a function?
Date Thu, 30 Jun 2016 14:29:29 GMT
That's helpful, thanks. I didn't see that thread earlier. But, it sounds like the best solution
is to use singletons in the executors, which I'm already doing.  (BTW - the reason why I consider
that method kind of hack-ish, is because the it makes the code a bit more difficult for others
to understand.). 

Based on its description, I was hoping that Spark's broadcast mechanism was using shared memory
between JVMs (memory mapped files or named pipes, etc), in which case the data structure would
only need to be created once per machine.  I'll have to take a look at the code.

Most likely, I'll have to implement a service on the node and have each executor call it.

Sent from my iPhone

> On Jun 30, 2016, at 8:45 AM, Yong Zhang <java8964@hotmail.com> wrote:
> 
> How about this old discussion related to similar problem as yours.
> 
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-td3203.html
> 
> Yong
> 
> From: aperrin@timerazor.com
> Date: Wed, 29 Jun 2016 14:00:07 +0000
> Subject: Possible to broadcast a function?
> To: user@spark.apache.org
> 
> The user guide describes a broadcast as a way to move a large dataset to each node:
> 
> "Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks. They can be used, for example, to give
every node a copy of a large input dataset in an efficient manner."
> 
> And the broadcast example shows it being used with a variable.
> 
> But, is it somehow possible to instead broadcast a function that can be executed once,
per node?
> 
> My use case is the following:
> 
> I have a large data structure that I currently create on each executor.  The way that
I create it is a hack.  That is, when the RDD function is executed on the executor, I block,
load a bunch of data (~250 GiB) from an external data source, create the data structure as
a static object in the JVM, and then resume execution.  This works, but it ends up costing
me a lot of extra memory (i.e. a few TiB when I have a lot of executors).
> 
> What I'd like to do is use the broadcast mechanism to load the data structure once, per
node.  But, I can't serialize the data structure from the driver.
> 
> Any ideas?
> 
> Thanks!
> 
> Aaron
> 

Mime
View raw message