spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: streaming window not behaving as advertised (v1.0.1)
Date Wed, 06 Aug 2014 02:37:42 GMT
1. udpateStateByKey should be called on all keys even if there is not data
corresponding to that key. There is a unit test for that.
https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala#L337

2. I am increasing the priority for this. Off the top of my head, this is
easy to fix, but hard to test reliably test in a unit test. Will fix it
soon after Spark 1.1 release.

TD


On Fri, Aug 1, 2014 at 7:37 AM, RodrigoB <rodrigo.boavida@aspect.com> wrote:

> Hi TD,
>
> I've also been fighting this issue only to find the exact same solution you
> are suggesting.
> Too bad I didn't find either the post or the issue sooner.
>
> I'm using a 1 second batch with N amount of kafka events (1 to 1 with the
> state objects) per batch and only calling the updatestatebykey function.
>
> This is my interpretation, please correct me if needed:
> Because of Spark’s lazy computation the RDDs weren’t being updated as
> expected on the batch interval execution. The assumption was that as long
> as
> I have a streaming batch run (with or without new messages), I should get
> updated RDDs, which was not happening. We only get updateStateByKey calls
> for objects which got events or that are forced through an output function
> to compute. I did not make further test to confirm this, but that's the
> given impression.
>
> This doesn't fit our requirements as we want to do duration updates based
> on
> the batch interval execution...so I had to force the computation of all the
> objects through the ForeachRDD function.
>
> I will also appreciate if the priority can be increased to the issue. I
> assume the ForeachRDD is additional unnecessary resource allocation
> (although I'm not sure how much) as opposite to doing it somehow by default
> on batch interval execution.
>
> tnks,
> Rod
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/streaming-window-not-behaving-as-advertised-v1-0-1-tp10453p11168.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message