Nice I was looking for a jira. So I agree we should justify why we are building something. Now to that direction here is what I have seen from my experience.
People quite often use state within their streaming app and may have large states (TBs). Shortening the pipeline by not having to copy data (to Cassandra for example for serving) is an advantage, in terms of at least latency and complexity.
This can be true if we advantage of state checkpointing (locally could be RocksDB or in general HDFS the latter is currently supported) along with an API to efficiently query data.
Some use cases I see:
- real-time dashboards and real-time reporting, the faster the better
- monitoring of state for operational reasons, app health etc...
- integrating with external services via an API eg. making accessible aggregations over time windows to some third party service within your system
Regarding requirements here are some of them:
- support of an API to expose state (could be done at the spark driver), like rest.
- supporting dynamic allocation (not sure how it affects state management)
- an efficient way to talk to executors to get the state (rpc?)
- making local state more efficient and easier accessible with an embedded db (I dont think this is supported from what I see, maybe wrong)?