spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christiaan Ras <>
Subject [Structured streaming] Merging streaming with semi-static datasets
Date Tue, 23 Jan 2018 11:32:30 GMT

I’m currently doing some tests with Structured Streaming and I’m wondering how I can merge
the streaming dataset with a more-or-less static dataset (from a JDBC source).
With more-or-less I mean a dataset which does not change that often and could be cached by
Spark for a while. It is possible to merge static datasets but static datasets will be refreshed
on every batch which increases batch duration.
With ‘traditional’ spark streaming (non-structured) I use a counter and refresh the dataset
(by using unpersist() and cache()) when it hits a certain threshold. I admit it’s not a
state-of-the-art solution but it works. With structured streaming I was not able to get this
mechanism working. It looks like the code between input and sinks runs once…

Is there a way to cache external datasets, use them in consecutive batches (merging with new
incoming streaming data, perform operations and sink results) and refresh the external datasets
after a specified number of batches?

View raw message