spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Streaming - lookup against reference data
Date Wed, 14 Sep 2016 21:09:25 GMT
Hmm is it just a lookup and the values are small? I do not think that in this case redis needs
to be installed on each worker node. Redis has a rather efficient protocol. Hence one or a
few dedicated redis nodes probably fit your purpose more then needed. Just try to reuse connections
and do not establish it for each lookup from the same node.

Additionally Redis has a lot of interesting data structures such as hyperloglogs.

Hbase - you can design here where to store which part of the reference data set and partition
in Spark accordingly. Depends on the data and is tricky.

About the other options I am a bit skeptical - especially since you need to include updated
data, might have side effects.

Nevertheless, you mention all the options that are possible. I guess for a true evaluation
you have to check your use case, the envisioned future architecture for other use cases, required
performance, maintability etc.

> On 14 Sep 2016, at 20:44, Tom Davis <> wrote:
> Hi all,
> Interested in patterns people use in the wild for lookup against reference data sets
from a Spark streaming job. The reference dataset will be updated during the life of the job
(although being 30mins out of date wouldn't be an issue, for example). 
> So far I have come up with a few options, all of which have advantages and disadvantages:
> 1. For small reference datasets, distribute the data as an in memory Map() from the driver,
refreshing it inside the foreachRDD() loop. 
> Obviously the limitation here is size. 
> 2. Run a Redis (or similar) cache on each worker node, perform lookups against this.

> There's some complexity to managing this, probably outside of the Spark job.
> 3. Load the reference data into an RDD, again inside the foreachRDD() loop on the driver.
Perform a join of the reference and stream batch RDDs. Perhaps keep the reference RDD in memory.

> I suspect that this will scale, but I also suspect there's going to be the potential
for a lot of data shuffling across the network which will slow things down. 
> 4. Similar to the Redis option, but use Hbase. Scales well and makes data available to
other services but is a call out over the network, albeit within the cluster.
> I guess there's no solution that fits all, but interested in other people's experience
and whether I've missed anything obvious. 
> Thanks,
> Tom

View raw message