spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Perrin <aper...@timerazor.com>
Subject Re: Possible to broadcast a function?
Date Wed, 29 Jun 2016 16:40:32 GMT
>From what I've read, people had seen performance issues when the JVM used
more than 60 GiB of memory.  I haven't tested it myself, but I guess not
true?

Also, how does one optimize memory when the driver allocates some on one
node?

For example, let's say my cluster has N nodes each with 500 GiB of memory.
And, let's say roughly, the amount of memory available per executors is
~80%, or ~400 GiB.  So, you're suggesting I should allocate ~400 GiB of mem
to the executor?  How does that affect the node that's hosting the driver,
when the driver uses ~10-15 GiB?  Or, do I have to decrease executor memory
to ~385 across all executors?

(Note: I'm running on Yarn, which may affect this.)

Thanks,

Aaron


On Wed, Jun 29, 2016 at 12:09 PM Sean Owen <sowen@cloudera.com> wrote:

> If you have one executor per machine, which is the right default thing
> to do, and this is a singleton in the JVM, then this does just have
> one copy per machine. Of course an executor is tied to an app, so if
> you mean to hold this data across executors that won't help.
>
>
> On Wed, Jun 29, 2016 at 3:00 PM, Aaron Perrin <aperrin@timerazor.com>
> wrote:
> > The user guide describes a broadcast as a way to move a large dataset to
> > each node:
> >
> > "Broadcast variables allow the programmer to keep a read-only variable
> > cached on each machine rather than shipping a copy of it with tasks. They
> > can be used, for example, to give every node a copy of a large input
> dataset
> > in an efficient manner."
> >
> > And the broadcast example shows it being used with a variable.
> >
> > But, is it somehow possible to instead broadcast a function that can be
> > executed once, per node?
> >
> > My use case is the following:
> >
> > I have a large data structure that I currently create on each executor.
> The
> > way that I create it is a hack.  That is, when the RDD function is
> executed
> > on the executor, I block, load a bunch of data (~250 GiB) from an
> external
> > data source, create the data structure as a static object in the JVM, and
> > then resume execution.  This works, but it ends up costing me a lot of
> extra
> > memory (i.e. a few TiB when I have a lot of executors).
> >
> > What I'd like to do is use the broadcast mechanism to load the data
> > structure once, per node.  But, I can't serialize the data structure from
> > the driver.
> >
> > Any ideas?
> >
> > Thanks!
> >
> > Aaron
> >
>

Mime
View raw message