spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rodrigo Boavida (JIRA)" <>
Subject [jira] [Commented] (SPARK-2629) Improved state management for Spark Streaming
Date Tue, 26 Jan 2016 15:40:40 GMT


Rodrigo Boavida commented on SPARK-2629:


I've experimented the new API method but am struggling to find an option to have the updateStateByKey
behavior which forces all elements to be recomputed on every batch (this is a requirement
for my application as I am calculating duration fields based on the streaming intervals and
updating external store for every element).

Seems this function does not allow to compute all elements as an option. Seems by design judging
by the design doc.

Could someone please clarify?


> Improved state management for Spark Streaming
> ---------------------------------------------
>                 Key: SPARK-2629
>                 URL:
>             Project: Spark
>          Issue Type: Epic
>          Components: Streaming
>    Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>  Current updateStateByKey provides stateful processing in Spark Streaming. It allows
the user to maintain per-key state and manage that state using an updateFunction. The updateFunction
is called for each key, and it uses new data and existing state of the key, to generate an
updated state. However, based on community feedback, we have learnt the following lessons.
> - Need for more optimized state management that does not scan every key
> - Need to make it easier to implement common use cases - (a) timeout of idle data, (b)
returning items other than state
> The high level idea that I am proposing is 
> - Introduce a new API -trackStateByKey- *mapWithState* that, allows the user to update
per-key state, and emit arbitrary records. The new API is necessary as this will have significantly
different semantics than the existing updateStateByKey API. This API will have direct support
for timeouts.
> - Internally, the system will keep the state data as a map/list within the partitions
of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value
data, it will lookup the map/list in the state RDD partition and create a new list/map of
updated state data. The new state RDD partition will be created based on the update data and
if necessary, with old data. 
> Here is the detailed design doc (*outdated, to be updated*). Please take a look and provide
feedback as comments.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message