spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Su <chen...@fb.com.INVALID>
Subject Re: [DISCUSS] Add RocksDB StateStore
Date Mon, 08 Feb 2021 19:51:05 GMT
+1 for (2) adding to external module.
I think this feature is useful and popular in practice, and option 2 is not conflict with
previous concern for dependency.

Thanks,
Cheng Su

From: Dongjoon Hyun <dongjoon.hyun@gmail.com>
Date: Monday, February 8, 2021 at 10:39 AM
To: Jacek Laskowski <jacek@japila.pl>
Cc: Liang-Chi Hsieh <viirya@gmail.com>, dev <dev@spark.apache.org>
Subject: Re: [DISCUSS] Add RocksDB StateStore

Thank you, Liang-chi and all.

+1 for (2) external module design because it can deliver the new feature in a safe way.

Bests,
Dongjoon

On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <jacek@japila.pl<mailto:jacek@japila.pl>>
wrote:
Hi,

I'm "okay to add RocksDB StateStore as external module". See no reason not to.

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski<https://about.me/JacekLaskowski>
"The Internals Of" Online Books<https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski<https://twitter.com/jaceklaskowski>

<https://twitter.com/jaceklaskowski>


On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <viirya@gmail.com<mailto:viirya@gmail.com>>
wrote:
Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/<http://apache-spark-developers-list.1001551.n3.nabble.com/>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<mailto:dev-unsubscribe@spark.apache.org>
Mime
View raw message