spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemant Bhanawat (JIRA)" <>
Subject [jira] [Updated] (SPARK-24144) monotonically_increasing_id on streaming dataFrames
Date Wed, 02 May 2018 05:44:00 GMT


Hemant Bhanawat updated SPARK-24144:
    Priority: Major  (was: Minor)

> monotonically_increasing_id on streaming dataFrames
> ---------------------------------------------------
>                 Key: SPARK-24144
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Hemant Bhanawat
>            Priority: Major
> For our use case, we want to assign snapshot ids (incrementing counters) to the incoming
records. In case of failures, the same record should get the same id after failure so that
the downstream DB can handle the records in a correct manner. 
> We were trying to do this by zipping the streaming rdds with that counter using a modified
version of ZippedWithIndexRDD. There are other ways to do that but it turns out all ways are
cumbersome and error prone in failure scenarios.
> As suggested on the spark user dev list, one way to do this would be to support monotonically_increasing_id
on streaming dataFrames in Spark code base. This would ensure that counters are incrementing
for the records of the stream. Also, since the counter can be checkpointed, it would work
well in case of failure scenarios. Last but not the least, doing this in spark would be the
most performance efficient way.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message