spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "deenar.toraskar" <>
Subject Re: converting DStream[String] into RDD[String] in spark streaming
Date Sun, 22 Mar 2015 08:43:42 GMT

Dstream.saveAsTextFiles internally calls foreachRDD and saveAsTextFile for
each interval

  def saveAsTextFiles(prefix: String, suffix: String = "") {
    val saveFunc = (rdd: RDD[T], time: Time) => {
      val file = rddToFileName(prefix, suffix, time)

    val sparkConf = new SparkConf().setAppName("TwitterRawJSON")
    val ssc = new StreamingContext(sparkConf, Seconds(30))

1) if there are no sliding window calls in this streaming context, will
there just one file written per interval?
2) if there is a sliding window call in the same context, such as

    val hashTags = stream.flatMap(json =>
    val topCounts60 =, 1)).reduceByKeyAndWindow(_ + _,
                     .map{case (topic, count) => (count, topic)}

will the some files get written multiples time (as long as the interval is
in the batch)


>>DStream.foreachRDD gives you an RDD[String] for each interval of 
course. I don't think it makes sense to say a DStream can be converted 
into one RDD since it is a stream. The past elements are inherently 
not supposed to stick around for a long time, and future elements 
aren't known. You may consider saving each RDD[String] to HDFS, and 
then simply loading it from HDFS as an RDD[String]. 

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message