spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <>
Subject Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)
Date Tue, 11 Oct 2016 17:55:25 GMT
This is super helpful, thanks for writing it up!

> *Delivering low latency, high throughput, and stability simultaneously:* Right
> now, our own tests indicate you can get at most two of these
> characteristics out of Spark Streaming at the same time. I know of two
> parties that have abandoned Spark Streaming because "pick any two" is not
> an acceptable answer to the latency/throughput/stability question for them.

Agree, this should be the major focus.

> *Complex event processing and state management:* Several groups I've
> talked to want to run a large number (tens or hundreds of thousands now,
> millions in the near future) of state machines over low-rate partitions of
> a high-rate stream. Covering these use cases translates roughly into a
> three sub-requirements: maintaining lots of persistent state efficiently,
> feeding tuples to each state machine in the right order, and exposing
> convenient programmer APIs for complex event detection and signal
> processing tasks.

I've heard this one too, but don't know of anyone actively working on it.
Would be awesome to open a JIRA and start discussing what the APIs would
look like.

> *Job graph scheduling and access to Dataset APIs: *These requirements
> come up in the context of groups who want to do streaming ETL. The general
> application profile that I've seen involves updating a large number of
> materialized views based on a smaller number of streams, using a mixture of
> SQL and nontraditional processing. The extended SQL that the Dataset APIs
> provide is useful in these applications. As for scheduling needs, it's
> common for multiple output tables to share intermediate computations. Users
> need an easy way to ensure that this shared computation happens only once,
> while controlling the peak memory utilization during each batch.

This sounds like two separate things to me.  High-level APIs (are streaming
DataFrames / Datasets missing anything?) and multi-query optimization for
streams.  I've been thinking about the latter.  I think we probably want to
crush latency/throughput/stability in the simpler case first, but after
that I think there is a lot of machinery already in SQL we can reuse (i.e.
the sameResult calculations used for caching).

View raw message