spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Fan <fanbin...@gmail.com>
Subject Re: Possible to broadcast a function?
Date Wed, 29 Jun 2016 22:18:48 GMT
following this suggestion, Aaron, you may take a look at Alluxio as the
off-heap in-memory data storage as input/output for Spark jobs if that
works for you.

See more intro on how to run Spark with Alluxio as data input / output.

http://www.alluxio.org/documentation/en/Running-Spark-on-Alluxio.html

- Bin

On Wed, Jun 29, 2016 at 8:40 AM, Sonal Goyal <sonalgoyal4@gmail.com> wrote:

> Have you looked at Alluxio? (earlier tachyon)
>
> Best Regards,
> Sonal
> Founder, Nube Technologies <http://www.nubetech.co>
> Reifier at Strata Hadoop World
> <https://www.youtube.com/watch?v=eD3LkpPQIgM>
> Reifier at Spark Summit 2015
> <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Wed, Jun 29, 2016 at 7:30 PM, Aaron Perrin <aperrin@timerazor.com>
> wrote:
>
>> The user guide describes a broadcast as a way to move a large dataset to
>> each node:
>>
>> "Broadcast variables allow the programmer to keep a read-only variable
>> cached on each machine rather than shipping a copy of it with tasks. They
>> can be used, for example, to give every node a copy of a large input
>> dataset in an efficient manner."
>>
>> And the broadcast example shows it being used with a variable.
>>
>> But, is it somehow possible to instead broadcast a function that can be
>> executed once, per node?
>>
>> My use case is the following:
>>
>> I have a large data structure that I currently create on each executor.
>> The way that I create it is a hack.  That is, when the RDD function is
>> executed on the executor, I block, load a bunch of data (~250 GiB) from an
>> external data source, create the data structure as a static object in the
>> JVM, and then resume execution.  This works, but it ends up costing me a
>> lot of extra memory (i.e. a few TiB when I have a lot of executors).
>>
>> What I'd like to do is use the broadcast mechanism to load the data
>> structure once, per node.  But, I can't serialize the data structure from
>> the driver.
>>
>> Any ideas?
>>
>> Thanks!
>>
>> Aaron
>>
>>
>

Mime
View raw message