spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Maas <gerard.m...@gmail.com>
Subject Re: Transforming the Dstream vs transforming each RDDs in the Dstream.
Date Mon, 20 Oct 2014 09:07:29 GMT
Pinging TD  -- I'm sure you know :-)

-kr, Gerard.

On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas <gerard.maas@gmail.com> wrote:

> Hi,
>
> We have been implementing several Spark Streaming jobs that are basically
> processing data and inserting it into Cassandra, sorting it among different
> keyspaces.
>
> We've been following the pattern:
>
> dstream.foreachRDD(rdd =>
>     val records = rdd.map(elem => record(elem))
>     targets.foreach(target => records.filter{record =>
> isTarget(target,record)}.writeToCassandra(target,table))
> )
>
> I've been wondering whether there would be a performance difference in
> transforming the dstream instead of transforming the RDD within the dstream
> with regards to how the transformations get scheduled.
>
> Instead of the RDD-centric computation, I could transform the dstream
> until the last step, where I need an rdd to store.
> For example, the  previous  transformation could be written as:
>
> val recordStream = dstream.map(elem => record(elem))
> targets.foreach{target => recordStream.filter(record =>
> isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}
>
> Would  be a difference in execution and/or performance?  What would be the
> preferred way to do this?
>
> Bonus question: Is there a better (more performant) way to sort the data
> in different "buckets" instead of filtering the data collection times the
> #buckets?
>
> thanks,  Gerard.
>
>

Mime
View raw message