spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dcam <dcame...@digitalocean.com>
Subject Re: [Structured Streaming] Avoiding multiple streaming queries
Date Tue, 13 Feb 2018 14:55:40 GMT
Hi Priyank

I have a similar structure, although I am reading from Kafka and sinking to
multiple MySQL tables. My input stream has multiple message types and each
is headed for a different MySQL table.

I've looked for a solution for a few months, and have only come up with two
alternatives:

1. Since I'm already using a ForeachSink, because there is no native MySQL
sink, I could sink each batch to the different tables in one sink. But,
having only one spark job doing all the sinking seems like it will be
confusing, and the sink itself will be fairly complex.

2. The same as your second option: have one job sort through the stream and
persist the sorted stream to HDFS. Read the sorted streams in individual
jobs and sink in to the appropriate tables.

I haven't implemented it yet, but it seems to me that the code for 2 will be
simpler, and operationally things will be clearer. If a job fails, I have a
better understanding of what state it is in.

Reading Manning's Big Data book from Nathan Marz and James Warren has been
influencing how I structure Spark jobs recently. They don't shy away from
persisting intermediate data sets, and I am embracing that right now in my
thinking.

Cheers!
Dave



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message