spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diana Carroll <dcarr...@cloudera.com>
Subject Re: streaming questions
Date Wed, 26 Mar 2014 19:18:36 GMT
Thanks, Tagatha, very helpful.  A follow-up question below...


On Wed, Mar 26, 2014 at 2:27 PM, Tathagata Das
<tathagata.das1565@gmail.com>wrote:

>
>
> *Answer 3:*You can do something like
> wordCounts.foreachRDD((rdd: RDD[...], time: Time) => {
>    if (rdd.take(1).size == 1) {
>       // There exists at least one element in RDD, so save it to file
>       rdd.saveAsTextFile(<generate file name based on time>)
>    }
> }
>
> Is calling foreachRDD and performing an operation on each individually as
efficient as performing the operation on the dstream?  Is this foreach
pretty much what dstream.saveAsTextFiles is doing anyway?

This also brings up a question I have about caching in the context of
streaming.  In  this example, would I want to call rdd.cache()?  I'm
calling two successive operations on the same rdd (take(1) and then
saveAsTextFile))...if I were doing this in regular Spark I'd want to cache
so I wouldn't need to re-calculate the rdd for both calls.  Does the same
apply here?

Thanks,
Diana

Mime
View raw message