spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <atan...@adobe.com>
Subject Re: Streaming Performance w/ UpdateStateByKey
Date Sat, 10 Oct 2015 12:28:33 GMT
How are you determining how much time is serialization taking?

I made this change in a streaming app that relies heavily on updateStateByKey. The memory
consumption went up 3x on the executors but I can't see any perf improvement. Task execution
time is the same and the serialization state metric in the spark UI is 0-1ms in both scenarios.

Any idea where else to look or why am I not seeing any performance uplift?

Thanks!
Adrian

Sent from my iPhone

On 06 Oct 2015, at 00:47, Tathagata Das <tdas@databricks.com<mailto:tdas@databricks.com>>
wrote:

You could call DStream.persist(StorageLevel.MEMORY_ONLY) on the stateDStream returned by updateStateByKey
to achieve the same. As you have seen, the downside is greater memory usage, and also higher
GC overheads (that;s the main one usually). So I suggest you run your benchmarks for a long
enough time to see what is the GC overheads. If it turns out that some batches are randomly
taking longer because of some task in some executor being stuck in GC, then its going to be
bad.

Alternatively, you could also starting playing with CMS GC, etc.

BTW, it would be amazing, if you can share the number in your benchmarks. Number of states,
how complex are the objects in state, whats the processing time and whats the improvements.

TD


On Mon, Oct 5, 2015 at 2:28 PM, Jeff Nadler <jnadler@srcginc.com<mailto:jnadler@srcginc.com>>
wrote:

While investigating performance challenges in a Streaming application using UpdateStateByKey,
I found that serialization of state was a meaningful (not dominant) portion of our execution
time.

In StateDStream.scala, serialized persistence is required:

     super.persist(StorageLevel.MEMORY_ONLY_SER)

I can see why that might be a good choice for a default.    For our workload, I made a clone
that uses StorageLevel.MEMORY_ONLY.   I've just completed some tests and it is indeed faster,
with the expected cost of greater memory usage.   For us that would be a good tradeoff.

I'm not taking any particular extra risks by doing this, am I?

Should this be configurable?  Perhaps yet another signature for PairDStreamFunctions.updateStateByKey?

Thanks for sharing any thoughts-

Jef





Mime
View raw message