spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GOEL Rajat <>
Subject Re: Structured Streaming metric for count of delayed/late data
Date Mon, 24 Aug 2020 05:14:21 GMT
Thanks for the pointers. I will try these changes.

From: Jungtaek Lim <>
Date: Saturday, 22 August 2020 at 2:41 PM
To: GOEL Rajat <>
Cc: "" <>
Subject: Re: Structured Streaming metric for count of delayed/late data

I proposed another approach which provided accurate count, though the number doesn't always
mean they're dropped. ( for details)

Btw, the limitation only applies to streaming aggregation, so you can implement the aggregation
by yourself via (flat)MapGroupsWithState - note that the local aggregation is "optimization",
so you may need to account the performance impact.

On Sat, Aug 22, 2020 at 1:29 PM GOEL Rajat <<>>
Thanks for pointing me to the Spark ticket and its limitations. Will try these changes.
Is there any workaround for this limitation of inaccurate count, maybe by adding some additional
streaming operation in SS job without impacting perf too much ?


From: Jungtaek Lim <<>>
Date: Friday, 21 August 2020 at 12:07 PM
To: Yuanjian Li <<>>
Cc: GOEL Rajat <<>>,
"<>" <<>>
Subject: Re: Structured Streaming metric for count of delayed/late data

One more thing to say, unfortunately, the number is not accurate compared to the input rows
on streaming aggregation, because Spark does local-aggregate and counts dropped inputs based
on "pre-locally-aggregated" rows. You may want to treat the number as whether dropping inputs
is happening or not.

On Fri, Aug 21, 2020 at 3:31 PM Yuanjian Li <<>>
The metrics have been added in, but the
target version is 3.1.
Maybe you can backport for testing since it's not a big change.


GOEL Rajat <<>> 于2020年8月20日周四
Hi All,

I have a query if someone can please help. Is there any metric or mechanism of printing count
of input records dropped due to watermarking (late data count) in a stream, during a window
based aggregation, in Structured Streaming ? I am using Spark 3.0.

Thanks & Regards,
View raw message