spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemant Bhanawat (JIRA)" <>
Subject [jira] [Created] (SPARK-24144) monotonically_increasing_id on streaming dataFrames
Date Wed, 02 May 2018 05:42:00 GMT
Hemant Bhanawat created SPARK-24144:

             Summary: monotonically_increasing_id on streaming dataFrames
                 Key: SPARK-24144
             Project: Spark
          Issue Type: New Feature
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: Hemant Bhanawat

For our use case, we want to assign snapshot ids (incrementing counters) to the incoming records.
In case of failures, the same record should get the same id after failure so that the downstream
DB can handle the records in a correct manner. 

We were trying to do this by zipping the streaming rdds with that counter using a modified
version of ZippedWithIndexRDD. There are other ways to do that but it turns out all ways are
cumbersome and error prone in failure scenarios.

As suggested on the spark user dev list, one way to do this would be to support monotonically_increasing_id
on streaming dataFrames in Spark code base. This would ensure that counters are incrementing
for the records of the stream. Also, since the counter can be checkpointed, it would work
well in case of failure scenarios. Last but not the least, doing this in spark would be the
most performance efficient way.


This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message