spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Wang <>
Subject Optimize the first map reduce of DStream
Date Tue, 24 Mar 2015 08:08:26 GMT

I'm learning Spark and I find there could be some optimize for the current
streaming implementation. Correct me if I'm wrong.

The current streaming implementation put the data of one batch into memory
(as RDD). But it seems not necessary.

For example, if I want to count the lines which contains word "Spark", I
just need to map every line to see if it contains word, then reduce it with
a sum function. After that, this line is no longer useful to keep it in

That is said, if the DStream only have one map and/or reduce operation on
it. It is not necessary to keep all the batch data in the memory. Something
like a pipeline should be OK.

Is it difficult to implement on top of the current implementation?


Bin Wang

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message