spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <>
Subject Re: streaming window not behaving as advertised (v1.0.1)
Date Wed, 06 Aug 2014 02:37:42 GMT
1. udpateStateByKey should be called on all keys even if there is not data
corresponding to that key. There is a unit test for that.

2. I am increasing the priority for this. Off the top of my head, this is
easy to fix, but hard to test reliably test in a unit test. Will fix it
soon after Spark 1.1 release.


On Fri, Aug 1, 2014 at 7:37 AM, RodrigoB <> wrote:

> Hi TD,
> I've also been fighting this issue only to find the exact same solution you
> are suggesting.
> Too bad I didn't find either the post or the issue sooner.
> I'm using a 1 second batch with N amount of kafka events (1 to 1 with the
> state objects) per batch and only calling the updatestatebykey function.
> This is my interpretation, please correct me if needed:
> Because of Spark’s lazy computation the RDDs weren’t being updated as
> expected on the batch interval execution. The assumption was that as long
> as
> I have a streaming batch run (with or without new messages), I should get
> updated RDDs, which was not happening. We only get updateStateByKey calls
> for objects which got events or that are forced through an output function
> to compute. I did not make further test to confirm this, but that's the
> given impression.
> This doesn't fit our requirements as we want to do duration updates based
> on
> the batch interval I had to force the computation of all the
> objects through the ForeachRDD function.
> I will also appreciate if the priority can be increased to the issue. I
> assume the ForeachRDD is additional unnecessary resource allocation
> (although I'm not sure how much) as opposite to doing it somehow by default
> on batch interval execution.
> tnks,
> Rod
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at

View raw message