spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Wang <wbi...@gmail.com>
Subject Re: Get only updated RDDs from or after updateStateBykey
Date Thu, 24 Sep 2015 09:26:49 GMT
Data that are not updated should be saved earlier: while the data added to
the DStream at the first time, it should be considered as updated. So save
the same data again is a waste.

What are the community is doing? Is there any doc or discussion that I can
look for? Thanks.



Shixiong Zhu <zsxwing@gmail.com>于2015年9月24日周四 下午4:27写道:

> For data that are not updated, where do you save? Or do you only want to
> avoid accessing database for those that are not updated?
>
> Besides,  the community is working on optimizing "updateStateBykey"'s
> performance. Hope it will be delivered soon.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-24 13:45 GMT+08:00 Bin Wang <wbin00@gmail.com>:
>
>> I've read the source code and it seems to be impossible, but I'd like to
>> confirm it.
>>
>> It is a very useful feature. For example, I need to store the state of
>> DStream into my database, in order to recovery them from next redeploy. But
>> I only need to save the updated ones. Save all keys into database is a lot
>> of waste.
>>
>> Through the source code, I think it could be add easily: StateDStream can
>> get prevStateRDD so that it can make a diff. Is there any chance to add
>> this as an API of StateDStream? If so, I can work on this feature.
>>
>> If not possible, is there any work around or hack to do this by myself?
>>
>
>

Mime
View raw message