spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From craigv <>
Subject Use Spark Streaming for Batch?
Date Sat, 21 Feb 2015 01:05:00 GMT
We have a sophisticated Spark Streaming application that we have been using
successfully in production for over a year to process a time series of
events.  Our application makes novel use of updateStateByKey() for state

We now have the need to perform exactly the same processing on input data
that's not real-time, but has been persisted to disk.  We do not want to
rewrite our Spark Streaming app unless we have to.

/Might it be possible to perform "large batches" processing on HDFS time
series data using Spark Streaming?/

1.I understand that there is not currently an InputDStream that could do
what's needed.  I would have to create such a thing.
2. Time is a problem.  I would have to use the timestamps on our events for
any time-based logic and state management
3. The "batch duration" would become meaningless in this scenario.  Could I
just set it to something really small (say 1 second) and then let it "fall
behind", processing the data as quickly as it could?

It all seems possible.  But could Spark Streaming work this way?  If I
created a DStream that delivered (say) months of events, could Spark
Streaming effectively process this in a "batch" fashion?

Any and all comments/ideas welcome!

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message