spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mendelson, Assaf" <Assaf.Mendel...@rsa.com>
Subject RE: few basic questions on structured streaming
Date Thu, 08 Dec 2016 12:27:24 GMT
For watermarking you can read this excellent article: part 1: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101,
part2: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102. It explains more
than just watermarking but it helped me understand a lot of the concepts in structured streaming.
In any case, watermarking is currently not implemented yet. I believe it is targeted at spark
2.1 which is supposed to come out soon.
Assaf.

From: kant kodali [mailto:kanth909@gmail.com]
Sent: Thursday, December 08, 2016 1:50 PM
To: user @spark
Subject: few basic questions on structured streaming

Hi All,

I read the documentation on Structured Streaming based on event time and I have the following
questions.

1. what happens if an event arrives few days late? Looks like we have an unbound table with
sorted time intervals as keys but I assume spark doesn't keep several days worth of data in
memory but rather it would checkpoint parts of the unbound table to a storage at a specified
interval such that if an event comes few days late it would update the part of the table that
is in memory plus the parts of the table that are in storage which contains the interval (Again
this is just my assumption, I don't know what it really does). is this correct so far?

2.  Say I am running a Spark Structured streaming Job for 90 days with a window interval of
10 mins and a slide interval of 5 mins. Does the output of this Job always return the entire
history in a table? other words the does the output on 90th day contains a table of 10 minute
time intervals from day 1 to day 90? If so, wouldn't that be too big to return as an output?

3. For Structured Streaming is it required to have a distributed storage such as HDFS? my
guess would be yes (based on what I said in #1) but I would like to confirm.

4. I briefly heard about watermarking. Are there any pointers where I can know them more in
detail? Specifically how watermarks could help in structured streaming and so on.

Thanks,
kant

Mime
View raw message