spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Use Spark Streaming for Batch?
Date Sat, 21 Feb 2015 09:42:53 GMT
I agree with your assessment as to why it *doesn't* just work. I don't
think a small batch duration helps as all files it sees at the outset
are processed in one batch. Your timestamps are a user-space concept
not a framework concept.

However, there ought to be a great deal of reusability between the
two, so maybe a small refactoring lets you use 95% of it as-is.

Isn't the core of your job to process an RDD of timestamp+data
together with state to produce new state? if you have the pieces to do
that, you should be able to hook them into Spark Streaming to its
timestamp value, and its updateStateByKey, but then as easily just
point this generic logic at an RDD from historical data and an empty
initial state?

On Sat, Feb 21, 2015 at 1:05 AM, craigv <> wrote:
> We have a sophisticated Spark Streaming application that we have been using
> successfully in production for over a year to process a time series of
> events.  Our application makes novel use of updateStateByKey() for state
> management.
> We now have the need to perform exactly the same processing on input data
> that's not real-time, but has been persisted to disk.  We do not want to
> rewrite our Spark Streaming app unless we have to.
> /Might it be possible to perform "large batches" processing on HDFS time
> series data using Spark Streaming?/
> 1.I understand that there is not currently an InputDStream that could do
> what's needed.  I would have to create such a thing.
> 2. Time is a problem.  I would have to use the timestamps on our events for
> any time-based logic and state management
> 3. The "batch duration" would become meaningless in this scenario.  Could I
> just set it to something really small (say 1 second) and then let it "fall
> behind", processing the data as quickly as it could?
> It all seems possible.  But could Spark Streaming work this way?  If I
> created a DStream that delivered (say) months of events, could Spark
> Streaming effectively process this in a "batch" fashion?
> Any and all comments/ideas welcome!
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message