spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tathagata Das (JIRA)" <>
Subject [jira] [Created] (SPARK-18124) Implement watermarking for handling late data
Date Wed, 26 Oct 2016 22:08:58 GMT
Tathagata Das created SPARK-18124:

             Summary: Implement watermarking for handling late data
                 Key: SPARK-18124
             Project: Spark
          Issue Type: Sub-task
            Reporter: Tathagata Das

Whenever we aggregate data by event time, we want to consider data is late and out-of-order
in terms of its event time. Since we keep aggregate keyed by the time as state, the state
will grow unbounded if we keep around all old aggregates in an attempt consider arbitrarily
late data. Since the state is a store in-memory, we have to prevent building up of this unbounded
state. Hence, we need a watermarking mechanism by which we will mark data that is older beyond
a threshold as “too late”, and stop updating the aggregates with them. This would allow
us to remove old aggregates that are never going to be updated, thus bounding the size of
the state.

Here is the design doc -

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message