spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Holsman <>
Subject controlling the time in spark-streaming
Date Thu, 22 May 2014 15:38:37 GMT

I'm writing a pilot project, and plan on using spark's streaming app for it.

To start with I have a dump of some access logs with their own timestamps,
and am using the textFileStream and some old files to test it with.

One of the issues I've come across is simulating the windows. I would like
use the timestamp from the access logs as the 'system time' instead of the
real clock time.

I googled a bit and found the 'manual' clock which appears to be used for
testing the job scheduler.. but I'm not sure what my next steps should be.

I'm guessing I'll need to do something like

1. use the textFileStream to create a 'DStream'
2. have some kind of DStream that runs on top of that that creates the RDDs
based on the timestamps Instead of the system time
3. the rest of my mappers.

Is this correct? or do I need to create my own 'textFileStream' to
initially create the RDDs and modify the system clock inside of it.

I'm not too concerned about out-of-order messages, going backwards in time,
or being 100% in sync across workers.. as this is more for

Are there better ways of achieving this? I would assume that controlling
the windows RDD buckets would be a common use case.


Ian Holsman
PH: + 61-3-9028 8133 / +1-(425) 998-7083

View raw message