spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [DISCUSS] Add RocksDB StateStore
Date Tue, 09 Feb 2021 14:12:13 GMT
I mean I am okay with adding it as an external module for the extra
clarification :-)

2021년 2월 9일 (화) 오후 11:10, Hyukjin Kwon <gurwls223@gmail.com>님이 작성:

> I'm good with this too.
>
> 2021년 2월 9일 (화) 오후 4:16, DB Tsai <dbtsai@dbtsai.com>님이 작성:
>
>> +1 to add it as an external module so people can test it out and give
>> feedback easier.
>>
>> On Mon, Feb 8, 2021 at 10:22 PM Gabor Somogyi <gabor.g.somogyi@gmail.com>
>> wrote:
>> >
>> > +1 adding it any way.
>> >
>> > On Mon, 8 Feb 2021, 21:54 Holden Karau, <holden@pigscanfly.ca> wrote:
>> >>
>> >> +1 for an external module.
>> >>
>> >> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <chengsu@fb.com.invalid>
>> wrote:
>> >>>
>> >>> +1 for (2) adding to external module.
>> >>>
>> >>> I think this feature is useful and popular in practice, and option 2
>> is not conflict with previous concern for dependency.
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Cheng Su
>> >>>
>> >>>
>> >>>
>> >>> From: Dongjoon Hyun <dongjoon.hyun@gmail.com>
>> >>> Date: Monday, February 8, 2021 at 10:39 AM
>> >>> To: Jacek Laskowski <jacek@japila.pl>
>> >>> Cc: Liang-Chi Hsieh <viirya@gmail.com>, dev <dev@spark.apache.org>
>> >>> Subject: Re: [DISCUSS] Add RocksDB StateStore
>> >>>
>> >>>
>> >>>
>> >>> Thank you, Liang-chi and all.
>> >>>
>> >>>
>> >>>
>> >>> +1 for (2) external module design because it can deliver the new
>> feature in a safe way.
>> >>>
>> >>>
>> >>>
>> >>> Bests,
>> >>>
>> >>> Dongjoon
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <jacek@japila.pl>
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>>
>> >>>
>> >>> I'm "okay to add RocksDB StateStore as external module". See no
>> reason not to.
>> >>>
>> >>>
>> >>> Pozdrawiam,
>> >>>
>> >>> Jacek Laskowski
>> >>>
>> >>> ----
>> >>>
>> >>> https://about.me/JacekLaskowski
>> >>>
>> >>> "The Internals Of" Online Books
>> >>>
>> >>> Follow me on https://twitter.com/jaceklaskowski
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <viirya@gmail.com>
>> wrote:
>> >>>
>> >>> Hi devs,
>> >>>
>> >>> In Spark structured streaming, we need state store for state
>> management for
>> >>> stateful operators such streaming aggregates, joins, etc. We have one
>> and
>> >>> only one state store implementation now. It is in-memory hashmap
>> which was
>> >>> backed up in HDFS complaint file system at the end of every
>> micro-batch.
>> >>>
>> >>> As it basically uses in-memory map to store states, memory
>> consumption is a
>> >>> serious issue and state store size is limited by the size of the
>> executor
>> >>> memory. Moreover, state store using more memory means it may impact
>> the
>> >>> performance of task execution that requires memory too.
>> >>>
>> >>> Internally we see more streaming applications that requires large
>> state in
>> >>> stateful operations. For such requirements, we need a StateStore not
>> rely on
>> >>> memory to store states.
>> >>>
>> >>> This seems to be also true externally as several other major streaming
>> >>> frameworks already use RocksDB for state management. RocksDB is an
>> embedded
>> >>> DB and streaming engines can use it to store state instead of memory
>> >>> storage.
>> >>>
>> >>> So seems to me, it is proven to be good choice for large state usage.
>> But
>> >>> Spark SS still lacks of a built-in state store for the requirement.
>> >>>
>> >>> Previously there was one attempt SPARK-28120 to add RocksDB
>> StateStore into
>> >>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>> >>> maintenance cost and it introduces RocksDB dependency.
>> >>>
>> >>> For the first concern, as more users require to use the feature, it
>> should
>> >>> be highly used code in SS and more developers will look at it. For
>> second
>> >>> one, we propose (SPARK-34198) to add it as an external module to
>> relieve the
>> >>> dependency concern.
>> >>>
>> >>> Because it was pushed back previously, I'm going to raise this
>> discussion to
>> >>> know what people think about it now, in advance of submitting any
>> code.
>> >>>
>> >>> I think there might be some possible opinions:
>> >>>
>> >>> 1. okay to add RocksDB StateStore into sql core module
>> >>> 2. not okay for 1, but okay to add RocksDB StateStore as external
>> module
>> >>> 3. either 1 or 2 is okay
>> >>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>> >>> external module
>> >>>
>> >>> Please let us know if you have some thoughts.
>> >>>
>> >>> Thank you.
>> >>>
>> >>> Liang-Chi Hsieh
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>
>> >>
>> >>
>> >> --
>> >> Twitter: https://twitter.com/holdenkarau
>> >> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>>
>> --
>> Sincerely,
>>
>> DB Tsai
>> ----------------------------------------------------------
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Mime
View raw message