spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikos Viorres <>
Subject Re: updateStateByKey performance & API
Date Wed, 18 Mar 2015 11:06:15 GMT
Hi Akhil,

Yes, that's what we are planning on doing at the end of the data. At the
moment I am doing performance testing before the job hits production and
testing on 4 cores to get baseline figures and deduced that in order to
grow to 10 - 15 million keys we ll need at batch interval of ~20 secs if we
don't want to allocate more than 8 cores on this job. The thing is that
since we have a big "silent" window on the user interactions where the
stream will have very few data we would like to be able to use these cores
for batch processing during that window but we can't the way it currently

best regards

On Wed, Mar 18, 2015 at 12:40 PM, Akhil Das <>

> You can always throw more machines at this and see if the performance is
> increasing. Since you haven't mentioned anything regarding your # cores etc.
> Thanks
> Best Regards
> On Wed, Mar 18, 2015 at 11:42 AM, nvrs <> wrote:
>> Hi all,
>> We are having a few issues with the performance of updateStateByKey
>> operation in Spark Streaming (1.2.1 at the moment) and any advice would be
>> greatly appreciated. Specifically, on each tick of the system (which is
>> set
>> at 10 secs) we need to update a state tuple where the key is the user_id
>> and
>> value an object with some state about the user. The problem is that using
>> Kryo serialization for 5M users, this gets really slow to the point that
>> we
>> have to increase the period to more than 10 seconds so as not to fall
>> behind.
>> The input for the streaming job is a Kafka stream which is consists of key
>> value pairs of user_ids with some sort of action codes, we join this to
>> our
>> checkpointed state key and update the state.
>> I understand that the reason for iterating over the whole state set is for
>> evicting items or updating state for everyone for time-depended
>> computations
>> but this does not apply on our situation and it hurts performance really
>> bad.
>> Is there a possibility of implementing in the future and extra call in the
>> API for updating only a specific subset of keys?
>> p.s. i will try asap to setting the dstream as non-serialized but then i
>> am
>> worried about GC and checkpointing performance
>> --
>> View this message in context:
>> Sent from the Apache Spark User List mailing list archive at
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

View raw message