hi there,

I noticed the 0.9 release announces exactly-once semantics for streams. I looked at the user guide and the primary mechanism for recovery appears to be checkpointing of user state. I have a few questions:

1. The default behavior is that checkpoints are kept in memory on the JobManager. Am I correct in assuming that this does *not* guarantee failure recovery or exactly-once semantics if the driver fails?

2 The alternative recommended approach is to checkpoint to HDFS. If I have a program where I am doing something like aggregating counts over thousands of keys... it doesn't seem tenable to save a huge number checkpoint files to HDFS with acceptable latency.

Could you comment at all on the persistence model for exactly-once in Flink. I am pretty confused because checkpointing to HDFS in this way seems to have limitations around scalability and latency.

- nate