spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yogesh Mahajan <>
Subject Re: [Structured Streaming] Avoiding multiple streaming queries
Date Tue, 13 Feb 2018 18:45:46 GMT
I had a similar issue and i think that’s where the structured streaming
design lacks.
Seems like Question#2 in your email is a viable workaround for you.

In my case, I have a custom Sink backed by an efficient in-memory column
store suited for fast ingestion.

I have a Kafka stream coming from one topic, and I need to classify the
stream based on schema.
For example, a Kafka topic can have three different types of schema
messages and I would like to ingest into the three different column
tables(having different schema) using my custom Sink implementation.

Right now only(?) option I have is to create three streaming queries
reading the same topic and ingesting to respective column tables using
their Sink implementations.
These three streaming queries create underlying three IncrementalExecutions
and three KafkaSources, and three queries reading the same data from the
same Kafka topic.
Even with CachedKafkaConsumers at partition level, this is not an efficient
way to handle a simple streaming use case.

One workaround to overcome this limitation is to have same schema for all
the messages in a Kafka partition, unfortunately this is not in our control
and customers cannot change it due to their dependencies on other

Thanks, <>

On Mon, Feb 12, 2018 at 5:54 PM, Priyank Shrivastava <
> wrote:

> I have a structured streaming query which sinks to Kafka.  This query has
> a complex aggregation logic.
> I would like to sink the output DF of this query to multiple Kafka topics
> each partitioned on a different ‘key’ column.  I don’t want to have
> multiple Kafka sinks for each of the different Kafka topics because that
> would mean running multiple streaming queries - one for each Kafka topic,
> especially since my aggregation logic is complex.
> Questions:
> 1.  Is there a way to output the results of a structured streaming query
> to multiple Kafka topics each with a different key column but without
> having to execute multiple streaming queries?
> 2.  If not,  would it be efficient to cascade the multiple queries such
> that the first query does the complex aggregation and writes output
> to Kafka and then the other queries just read the output of the first query
> and write their topics to Kafka thus avoiding doing the complex aggregation
> again?
> Thanks in advance for any help.
> Priyank

View raw message