spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabh...@gmail.com>
Subject Re: [Structured Streaming] Metrics or logs of events that are ignored due to watermark
Date Tue, 03 Jul 2018 05:42:56 GMT
Hi,

I have tried it via https://github.com/apache/spark/pull/21617 but soon
realized that it is not accurate count of late input rows because Spark
lazily applies watermark and discards rows at state operator(s) which
inputs are not necessarily same as origin input rows (some already filtered
out, multiple rows aggregated into one).

To get accurate count (or rows itself) of late input rows, we should filter
out late input rows in first phase of query. It would be less flexible
(mostly derived field no longer becomes watermark field) but majority of
streaming frameworks adopt this policy and provide late input rows based on
this.

So I think this is valuable to address, and I'm planning to try to address
it, but it would be OK for someone to address it earlier.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 7월 3일 (화) 오전 3:39, subramgr <subramanian.girish@gmail.com>님이 작성:

> Hi all,
>
> Do we have some logs or some metrics that get recorded in log files or some
> metrics sinker about the number of events that are ignored due to watermark
> in structured streaming?
>
> Thanks
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message