spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: [DISCUSS] Add RocksDB StateStore
Date Tue, 09 Feb 2021 07:02:24 GMT
+1 to add, no matter to add under sql-core vs external module.

Rationalization for myself:

* The discussion thread and voices here show strong demand for adding
RocksDB state store out of the box.
* No workaround on huge state store problem out of the box. Direct
competitors on streaming frameworks provide it for years.
* Maintenance cost is the major concern when evaluating to add something,
but it can't be applied here, as contributors/committers from various
companies are willing to contribute.
* Apache Bahir project is no longer something being maintained actively -
the last release was in September 2019 based on Spark 2.4.0. We can no
longer easily say "let's add to Bahir instead".



On Tue, Feb 9, 2021 at 3:22 PM Gabor Somogyi <gabor.g.somogyi@gmail.com>
wrote:

> +1 adding it any way.
>
> On Mon, 8 Feb 2021, 21:54 Holden Karau, <holden@pigscanfly.ca> wrote:
>
>> +1 for an external module.
>>
>> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <chengsu@fb.com.invalid> wrote:
>>
>>> +1 for (2) adding to external module.
>>>
>>> I think this feature is useful and popular in practice, and option 2 is
>>> not conflict with previous concern for dependency.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> *From: *Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>> *Date: *Monday, February 8, 2021 at 10:39 AM
>>> *To: *Jacek Laskowski <jacek@japila.pl>
>>> *Cc: *Liang-Chi Hsieh <viirya@gmail.com>, dev <dev@spark.apache.org>
>>> *Subject: *Re: [DISCUSS] Add RocksDB StateStore
>>>
>>>
>>>
>>> Thank you, Liang-chi and all.
>>>
>>>
>>>
>>> +1 for (2) external module design because it can deliver the new feature
>>> in a safe way.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon
>>>
>>>
>>>
>>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <jacek@japila.pl> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm "okay to add RocksDB StateStore as external module". See no reason
>>> not to.
>>>
>>>
>>> Pozdrawiam,
>>>
>>> Jacek Laskowski
>>>
>>> ----
>>>
>>> https://about.me/JacekLaskowski
>>>
>>> "The Internals Of" Online Books <https://books.japila.pl/>
>>>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>>
>>> <https://twitter.com/jaceklaskowski>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <viirya@gmail.com> wrote:
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, we need state store for state management
>>> for
>>> stateful operators such streaming aggregates, joins, etc. We have one and
>>> only one state store implementation now. It is in-memory hashmap which
>>> was
>>> backed up in HDFS complaint file system at the end of every micro-batch.
>>>
>>> As it basically uses in-memory map to store states, memory consumption
>>> is a
>>> serious issue and state store size is limited by the size of the executor
>>> memory. Moreover, state store using more memory means it may impact the
>>> performance of task execution that requires memory too.
>>>
>>> Internally we see more streaming applications that requires large state
>>> in
>>> stateful operations. For such requirements, we need a StateStore not
>>> rely on
>>> memory to store states.
>>>
>>> This seems to be also true externally as several other major streaming
>>> frameworks already use RocksDB for state management. RocksDB is an
>>> embedded
>>> DB and streaming engines can use it to store state instead of memory
>>> storage.
>>>
>>> So seems to me, it is proven to be good choice for large state usage. But
>>> Spark SS still lacks of a built-in state store for the requirement.
>>>
>>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore
>>> into
>>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>>> maintenance cost and it introduces RocksDB dependency.
>>>
>>> For the first concern, as more users require to use the feature, it
>>> should
>>> be highly used code in SS and more developers will look at it. For second
>>> one, we propose (SPARK-34198) to add it as an external module to relieve
>>> the
>>> dependency concern.
>>>
>>> Because it was pushed back previously, I'm going to raise this
>>> discussion to
>>> know what people think about it now, in advance of submitting any code.
>>>
>>> I think there might be some possible opinions:
>>>
>>> 1. okay to add RocksDB StateStore into sql core module
>>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>>> 3. either 1 or 2 is okay
>>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>>> external module
>>>
>>> Please let us know if you have some thoughts.
>>>
>>> Thank you.
>>>
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Mime
View raw message