spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <>
Subject Re: Stream which needs to be “joined” with another Stream of “Reference” data.
Date Mon, 03 May 2021 16:51:50 GMT
Can you please clarify:

   1. The IOT messages in one batch have the same device_id or every row
   has different device_id?
   2. The RDBMS table can be read through JDBC in Spark and a dataframe can
   be created on. Does that work for you? You do not really need to stream the
   reference table.


   view my Linkedin profile

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 3 May 2021 at 17:37, Eric Beabes <> wrote:

> I would like to develop a Spark Structured Streaming job that reads
> messages in a Stream which needs to be “joined” with another Stream of
> “Reference” data.
> For example, let’s say I’m reading messages from Kafka coming in from
> (lots of) IOT devices. This message has a ‘device_id’. We have a DEVICE
> table on a relational database. What I need to do is “join” the ‘device_id’
> in the message with the ‘device_id’ on the table to enrich the incoming
> message. Somewhere I read that, this can be done by joining two streams. I
> guess, we can create a “Stream” that reads the DEVICE table once every hour
> or so.
> Questions:
> 1) Is this the right way to solve this use case?
> 2) Should we use a Stateful Stream for reading DEVICE table with State
> timeout set to an hour?
> 3) What would happen while the DEVICE state is getting updated from the
> table on the relational database?
> Guidance would be greatly appreciated. Thanks.

View raw message