spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: [DISCUSS] Add RocksDB StateStore
Date Mon, 08 Feb 2021 18:39:32 GMT
Thank you, Liang-chi and all.

+1 for (2) external module design because it can deliver the new feature in
a safe way.

Bests,
Dongjoon

On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <jacek@japila.pl> wrote:

> Hi,
>
> I'm "okay to add RocksDB StateStore as external module". See no reason not
> to.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <viirya@gmail.com> wrote:
>
>> Hi devs,
>>
>> In Spark structured streaming, we need state store for state management
>> for
>> stateful operators such streaming aggregates, joins, etc. We have one and
>> only one state store implementation now. It is in-memory hashmap which was
>> backed up in HDFS complaint file system at the end of every micro-batch.
>>
>> As it basically uses in-memory map to store states, memory consumption is
>> a
>> serious issue and state store size is limited by the size of the executor
>> memory. Moreover, state store using more memory means it may impact the
>> performance of task execution that requires memory too.
>>
>> Internally we see more streaming applications that requires large state in
>> stateful operations. For such requirements, we need a StateStore not rely
>> on
>> memory to store states.
>>
>> This seems to be also true externally as several other major streaming
>> frameworks already use RocksDB for state management. RocksDB is an
>> embedded
>> DB and streaming engines can use it to store state instead of memory
>> storage.
>>
>> So seems to me, it is proven to be good choice for large state usage. But
>> Spark SS still lacks of a built-in state store for the requirement.
>>
>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore
>> into
>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>> maintenance cost and it introduces RocksDB dependency.
>>
>> For the first concern, as more users require to use the feature, it should
>> be highly used code in SS and more developers will look at it. For second
>> one, we propose (SPARK-34198) to add it as an external module to relieve
>> the
>> dependency concern.
>>
>> Because it was pushed back previously, I'm going to raise this discussion
>> to
>> know what people think about it now, in advance of submitting any code.
>>
>> I think there might be some possible opinions:
>>
>> 1. okay to add RocksDB StateStore into sql core module
>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>> 3. either 1 or 2 is okay
>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>> external module
>>
>> Please let us know if you have some thoughts.
>>
>> Thank you.
>>
>> Liang-Chi Hsieh
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Mime
View raw message