What are your thoughts on keeping this solution (or not), considering that Spark Streaming v1.2 will have built-in recoverability of the received data?
I'm concerned about the complexity of this solution with regards the added complexity and performance overhead by the writing of big amounts of data into HDFS on a small batch interval.
I think the whole solution is well designed and thought but I'm afraid if it does really fit all needs with checkpoint based technologies like Flume or Kafka, by hiding away the management of the offset from the user code.
If instead of saving received data into HDFS, the ReceiverHandler would be saving some metadata (such as offset in the case of Kafka) specified by the custom receiver passed into the StreamingContext, then upon driver restart, that metadata could be used by the custom receiver to recover the point from which it should start receiving data once more.
Anyone's comments will be greatly appreciated.