Hi,
Any input on this? I'm willing to instrument further and experiment
if there are any ideas.
On Mon, May 4, 2015 at 11:27 AM, Akshat Aranya <aaranya@gmail.com> wrote:
> Hi,
>
> I have been investigating scheduling delays in Spark and I found some
> unexplained anomalies. In my use case, I have two stages after
> collapsing the transformations: the first is a mapPartitions() and the
> second is a sortByKey(). I found that the task serialization for the
> first stage takes much longer than the second.
>
> 1. mapPartitions() - this launches 256 tasks in 603 ms (avg. 2.363
> ms). Each task serializes to 1220 bytes.
> 2. sortByKey() - this launches 64 tasks in 12 ms (avg. 0.187 ms). Each
> task serializes to 1139 bytes.
>
> Note that the serialized size of the task is similar, but the avg.
> scheduling time is very different. I also instrumented the code to
> print out the serialization time, and it seems like it is indeed the
> serialization that takes much longer. This seemed weird to me because
> the biggest part of the Task, the taskBinary is actually directly
> copied from a byte array.
>
> Any explanation of why this happens?
>
> Thanks,
> Akshat
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
|