I am experimenting with Spark 2.3.0 stream-stream join feature to see if I can leverage it to replace some of our existing services.
Imagine I have 3 worker nodes with each node having (16GB RAM and 100GB SSD). My input dataset which is in Kafka is about 250GB per day. Now I want to do a stream-stream join across 8 data frames with a watermark set to 24 hours of Tumbling window. so, I need to hold state for 24 hours and then I can clear all the data.
1) What happens if I can't fit data into memory while doing stream-stream join?
2) What Storage Level should I choose here for near optimal performance?
3) Any other suggestions?