Why do you want to start the new code in parallel to the old one? Why not stop the old one, and then start the new one? Structured Streaming ensures that all checkpoint information (offsets and state) are future-compatible (as long as state schema is unchanged), hence new code should be able to pick exactly where the old code left off. 


On Wed, Mar 21, 2018 at 11:56 AM, Priyank Shrivastava <priyank@asperasoft.com> wrote:
I am using Structured Streaming with Spark 2.2.  We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees.  I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.

As some of the sources suggest I can start the updated application parallelly with the old application until it catches up with the old application in terms of data, and then kill the old one.  But then the new application will have to re-read/re-process all the data in kafka which could take a long time.

I want to AVOID this re-processing of the data in the newly deployed updated application.

One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs.  And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.

So a couple of questions:  
1.  Is the above-stated solution a valid solution?
2.  If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?

Any guidance, example or information source is appreciated.