spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinoth Chandar <vin...@uber.com>
Subject Re: Size of arbitrary state managed via DStream updateStateByKey
Date Wed, 01 Apr 2015 19:35:13 GMT
Thanks for confirming!

On Wed, Apr 1, 2015 at 12:33 PM, Tathagata Das <tdas@databricks.com> wrote:

> In the current state yes there will be performance issues. It can be done
> much more efficiently and we are working on doing that.
>
> TD
>
> On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar <vinoth@uber.com> wrote:
>
>> Hi all,
>>
>> As I understand from docs and talks, the streaming state is in memory as
>> RDD (optionally checkpointable to disk). SPARK-2629 hints that this in
>> memory structure is not indexed efficiently?
>>
>> I am wondering how my performance would be if the streaming state does
>> not fit in memory (say 100GB state over 10GB total RAM), and I did random
>> updates to different keys via updateStateByKey? (Would throwing in SSDs
>> help out).
>>
>> I am picturing some kind of performance degeneration would happen akin to
>> Linux/innoDB Buffer cache thrashing. But if someone can demystify this,
>> that would be awesome.
>>
>> Thanks
>> Vinoth
>>
>>
>

Mime
View raw message