spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Beabes <>
Subject Blacklisting in Spark Stateful Structured Streaming
Date Tue, 10 Nov 2020 22:36:37 GMT
Currently we’ve a “Stateful” Spark Structured Streaming job that computes
aggregates for each ID. I need to implement a new requirement which says
that if the no. of incoming messages for a particular ID exceeds a certain
value then add this ID to a blacklist & remove the state for it. Going
forward for any ID that’s blacklisted we will not create a state for it.
The message will simply get filtered out if the ID is blacklisted.

What’s the best way to implement this in Spark Structured Streaming?
Essentially what we need to do is create a Distributed HashSet that gets
updated intermittently & make this HashSet available to all Executors so
that they can filter out unwanted messages.

Any pointers would be greatly appreciated. Is the only option to use a
3rdparty Distributed Cache tool such as EhCache, Redis etc?

View raw message