spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pfeiffer <>
Subject Re: Use Spark Streaming for Batch?
Date Mon, 23 Feb 2015 01:21:07 GMT

On Sat, Feb 21, 2015 at 1:05 AM, craigv <> wrote:
> > /Might it be possible to perform "large batches" processing on HDFS time
> > series data using Spark Streaming?/
> >
> > 1.I understand that there is not currently an InputDStream that could do
> > what's needed.  I would have to create such a thing.
> > 2. Time is a problem.  I would have to use the timestamps on our events
> for
> > any time-based logic and state management
> > 3. The "batch duration" would become meaningless in this scenario.
> Could I
> > just set it to something really small (say 1 second) and then let it
> "fall
> > behind", processing the data as quickly as it could?

So, if it is not an issue for you if everything is processed in one batch,
you can use streamingContext.textFileStream(hdfsDirectory). This will
create a DStream that has a huge RDD with all data in the first batch and
then empty batches afterwards. You can have small batch size, should not be
a problem.
An alternative would be to write some code that creates one RDD per file in
your HDFS directory, create a Queue of those RDDs and then use
streamingContext.queueStream(), possibly with the oneAtATime=true parameter
(which will process only one RDD per batch).

However, to do window computations etc with the timestamps embedded *in*
your data will be a major effort, as in: You cannot use the existing
windowing functionality from Spark Streaming. If you want to read more
about that, there have been a number of discussions about that topic on
this list; maybe you can look them up.


View raw message