Serialization only occurs intra-stage, when you are using Python, and as far as I know, only in the first stage, when reading the data and passing it to the Python interpreter the first time.

Multiple operations are just chains of simple map and flatMap operators at task level on simple Scala Iterator of type T, where T is the type of RDD.

On Thu, Aug 13, 2015 at 4:09 PM Hemant Bhanawat <hemant9379@gmail.com> wrote:
A chain of map and flatmap does not cause any serialization-deserialization. 



On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann <mark.heimann@kard.info> wrote:
Hello everyone,

I am wondering what the effect of serialization is within a stage.

My understanding of Spark as an execution engine is that the data flow graph is divided into stages and a new stage always starts after an operation/transformation that cannot be pipelined (such as groupBy or join) because it can only be completed after the whole data set has "been taken care off". At the end of a stage shuffle files are written and at the beginning of the next stage they are read from.

Within a stage my understanding is that pipelining is used, therefore I wonder whether there is any serialization overhead involved when there is no shuffling taking place. I am also assuming that my data set fits into memory and must not be spilled to disk.

So if I would chain multiple map or flatMap operations and they end up in the same stage, will there be any serialization overhead for piping the result of the first map operation as a parameter into the following map operation?

Any ideas and feedback appreciated, thanks a lot.

Best regards,
Mark