spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hahaha sc <>
Subject Re: Structured Streaming & Enrichment Broadcasts
Date Fri, 29 Nov 2019 07:55:42 GMT
I have a scenario similar to yours, but we are using udf to do exactly
that. But you need to get the value of a broadcast variable from the udf.
But it's not clear how to achieve it, does anyone know?

Burak Yavuz <> 于2019年11月19日周二 下午12:23写道:

> If you store the data that you're going to broadcast as a Delta table (see
> and perform a stream-batch (where your Delta table is the
> batch) join, it will auto-update once the table receives any updates.
> Best,
> Burak
> On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey <>
> wrote:
>> Hello.
>> We're running applications using Spark Streaming.  We're going to begin
>> work to move to using Structured Streaming.  One of our key scenarios is to
>> lookup values from an external data source for each record in an incoming
>> stream.  In Spark Streaming we currently read the external data, broadcast
>> it and then lookup the value from the broadcast.  The broadcast value is
>> refreshed on a periodic basis - with the need to refresh evaluated on each
>> batch (in a foreachRDD).  The broadcasts are somewhat large (~1M records).
>> Each stream we're doing the lookup(s) for is ~6M records / second.
>> While we could conceivably continue this pattern in Structured Streaming
>> with Spark 2.4.x and the 'foreachBatch', based on my read of documentation
>> this seems like a bit of an anti-pattern in Structured Streaming.
>> So I am looking for advice: What mechanism would you suggest to on a
>> periodic basis read an external data source and do a fast lookup for a
>> streaming input.  One option appears to be to do a broadcast left outer
>> join?  In the past this mechanism has been less easy to performance tune
>> than doing an explicit broadcast and lookup.
>> Regards,
>> Bryan Jeffrey

View raw message