spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Awatramani <sanjay_a...@yahoo.com>
Subject Relation between DStream and RDDs
Date Thu, 20 Mar 2014 05:20:52 GMT
Hi,

As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will run a given func
on each and every RDD inside a DStream.

I created a simple program which reads log files from a folder every hour:
JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(60 * 60 * 1000));
//1 hour
JavaDStream<String> obj = stcObj.textFileStream("/Users/path/to/Input");

When the interval is reached, Spark reads all the files and creates one and only one RDD (as
i verified from a sysout inside foreachRDD).

The streaming doc at a lot of places gives an indication that many operations (e.g. flatMap)
on a DStream are applied individually to a RDD and the resulting DStream consists of the mapped
RDDs in the same number as the input DStream.
ref: https://spark.apache.org/docs/latest/streaming-programming-guide.html#dstreams

If that is the case, how can i generate a scenario where in I have multiple RDDs inside a
DStream in my example ?

Regards,
Sanjay
Mime
View raw message