spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Beabes <mailinglist...@gmail.com>
Subject Re: Stream which needs to be “joined” with another Stream of “Reference” data.
Date Mon, 03 May 2021 17:01:22 GMT
1) Device_id might be different for messages in a batch.
2) It's a Streaming application. The IOT messages are getting read in a
Structured Streaming job in a "Stream". The Dataframe would need to be
updated every hour. Have you done something similar in the past? Do you
have an example to share?

On Mon, May 3, 2021 at 9:52 AM Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Can you please clarify:
>
>
>    1. The IOT messages in one batch have the same device_id or every row
>    has different device_id?
>    2. The RDBMS table can be read through JDBC in Spark and a dataframe
>    can be created on. Does that work for you? You do not really need to stream
>    the reference table.
>
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 3 May 2021 at 17:37, Eric Beabes <mailinglists19@gmail.com> wrote:
>
>> I would like to develop a Spark Structured Streaming job that reads
>> messages in a Stream which needs to be “joined” with another Stream of
>> “Reference” data.
>>
>> For example, let’s say I’m reading messages from Kafka coming in from
>> (lots of) IOT devices. This message has a ‘device_id’. We have a DEVICE
>> table on a relational database. What I need to do is “join” the ‘device_id’
>> in the message with the ‘device_id’ on the table to enrich the incoming
>> message. Somewhere I read that, this can be done by joining two streams. I
>> guess, we can create a “Stream” that reads the DEVICE table once every hour
>> or so.
>>
>> Questions:
>> 1) Is this the right way to solve this use case?
>> 2) Should we use a Stateful Stream for reading DEVICE table with State
>> timeout set to an hour?
>> 3) What would happen while the DEVICE state is getting updated from the
>> table on the relational database?
>>
>> Guidance would be greatly appreciated. Thanks.
>>
>

Mime
View raw message