spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: Lazy Spark Structured Streaming
Date Sun, 02 Aug 2020 12:01:13 GMT
SPARK-24156 runs the no-data batch to apply the updated watermark, but the
updated watermark may not be eligible to evict all state rows. (e.g.
window, lateness of watermark)
You'll still need to provide dummy input record to advance watermark, so
that all expected state rows can be evicted.

On Sun, Aug 2, 2020 at 5:44 PM Phillip Henry <londonjavaman@gmail.com>
wrote:

> Thanks, Jungtaek. Very useful information.
>
> Could I please trouble you with one further question - what you said makes
> perfect sense but to what exactly does SPARK-24156
> <https://issues.apache.org/jira/browse/SPARK-24156> refer if not fixing
> the "need to add a dummy record to move watermark forward"?
>
> Kind regards,
>
> Phillip
>
>
>
>
> On Mon, Jul 27, 2020 at 11:41 PM Jungtaek Lim <
> kabhwan.opensource@gmail.com> wrote:
>
>> I'm not sure what exactly your problem is, but given you've mentioned
>> window and OutputMode.Append, you may want to remind that append mode
>> doesn't produce the output of aggregation unless the watermark "passes by".
>> It's expected behavior if you're seeing lazy outputs on OutputMode.Append
>> compared to OutputMode.Update.
>>
>> Unfortunately there's no mechanism on SSS to move forward only watermark
>> without actual input, so if you want to test some behavior on
>> OutputMode.Append you would need to add a dummy record to move watermark
>> forward.
>>
>> Hope this helps.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry <londonjavaman@gmail.com>
>> wrote:
>>
>>> Sorry, should have mentioned that Spark only seems reluctant to take the
>>> last windowed, groupBy batch from Kafka when using OutputMode.Append.
>>>
>>> I've asked on StackOverflow:
>>>
>>> https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka
>>> but am still struggling. Can anybody please help?
>>>
>>> How do people test their SSS code if you have to put a message on Kafka
>>> to get Spark to consume a batch?
>>>
>>> Kind regards,
>>>
>>> Phillip
>>>
>>>
>>> On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry <londonjavaman@gmail.com>
>>> wrote:
>>>
>>>> Hi, folks.
>>>>
>>>> I noticed that SSS won't process a waiting batch if there are no
>>>> batches after that. To put it another way, Spark must always leave one
>>>> batch on Kafka waiting to be consumed.
>>>>
>>>> There is a JIRA for this at:
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-24156
>>>>
>>>> that says it's resolved in 2.4.0 but my code
>>>> <https://github.com/PhillHenry/SSSPlayground/blob/Spark2/src/test/scala/uk/co/odinconsultants/sssplayground/windows/TimestampedStreamingSpec.scala>
>>>> is using 2.4.2 yet I still see Spark reluctant to consume another batch
>>>> from Kafka if it means there is nothing else waiting to be processed in the
>>>> topic.
>>>>
>>>> Do I have to do something special to exploit the behaviour that
>>>> SPARK-24156 says it has addressed?
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>>
>>>>
>>>>

Mime
View raw message