spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason <>
Subject Re: Alternative to Large Broadcast Variables
Date Fri, 28 Aug 2015 19:44:28 GMT
You could try using an external key value store (like HBase, Redis) and
perform lookups/updates inside of your mappers (you'd need to create the
connection within a mapPartitions code block to avoid the connection
setup/teardown overhead)?

I haven't done this myself though, so I'm just throwing the idea out there.

On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff <> wrote:

> Hi,
> I am working on a Spark application that is using of a large (~3G)
> broadcast variable as a lookup table. The application refines the data in
> this lookup table in an iterative manner. So this large variable is
> broadcast many times during the lifetime of the application process.
> From what I have observed perhaps 60% of the execution time is spent
> waiting for the variable to broadcast in each iteration. My reading of a
> Spark performance article[1] suggests that the time spent broadcasting will
> increase with the number of nodes I add.
> My question for the group - what would you suggest as an alternative to
> broadcasting a large variable like this?
> One approach I have considered is segmenting my RDD and adding a copy of
> the lookup table for each X number of values to process. So, for example,
> if I have a list of 1 million entries to process (eg, RDD[Entry]), I could
> split this into segments of 100K entries, with a copy of the lookup table,
> and make that an RDD[(Lookup, Array[Entry]).
> Another solution I am looking at it is making the lookup table an RDD
> instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to
> improve performance. One issue with this approach is that I would have to
> rewrite my application code to use two RDDs so that I do not reference the
> lookup RDD in the from within the closure of another RDD.
> Any other recommendations?
> Jeff
> [1]
> [2]

View raw message