spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <>
Subject Re: UpdateStateByKey - How to improve performance?
Date Wed, 06 Aug 2014 21:09:04 GMT
Hello Venkat,

Your thoughts are quite spot on. The current implementation was designed to
allow the functionality of timing out a state. For this to be possible, the
update function need to be called each key even if there is no new data, so
that the function can check things like "last update time", etc to time
itself out and return a None as state. However, in Spark 1.2 I plan to
improve on the performance for such scenarios as yours.

For the time being, you could also try other techniques
for improving performance (if you havent tried already). You can also set
the storage level of dstream as non-serialized, which may improve perf.


On Wed, Aug 6, 2014 at 1:29 PM, Venkat Subramanian <>

> The method
> def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) =>
> Option[S] ): DStream[(K, S)]
> takes Dstream (K,V) and Produces DStream (K,S)  in Spark Streaming
> We have a input Dstream(K,V) that has 40,000 elements. We update on average
> of 1000  elements of them in every 3 second batch, but based on how this
> updateStateByKey function is defined, we are looping through 40,000
> elements
> (Seq[V]) to make an update for just 1000 elements and not updating 39000
> elements. I think looping through extra 39000 elements is a waste of
> performance.
> Isn't there a better way to update this efficiently by just figuring out
> the
> a hash map for the 1000 elements that are required to be updated and just
> updating it (without looping through the unwanted elements)?  Shouldn't
> there be a Streaming update function provided that updates selective
> members
> or are we missing some concepts here?
> I think updateStateByKey may be causing lot of performance degradation in
> our app as we keep doing this again and again for every batch. Please let
> us
> know if my thought process is correct here.
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message