spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Boisvert <alex.boisv...@gmail.com>
Subject partitions, coalesce() and parallelism
Date Tue, 24 Jun 2014 19:50:58 GMT
With the following pseudo-code,

val rdd1 = sc.sequenceFile(...) // has > 100 partitions
val rdd2 = rdd1.coalesce(100)
val rdd3 = rdd2 map { ... }
val rdd4 = rdd3.coalesce(2)
val rdd5 = rdd4.saveAsTextFile(...) // want only two output files

I would expect the parallelism of the map() operation to be 100 concurrent
tasks, and the parallelism of the save() operation to be 2.

However, it appears the parallelism of the entire chain is 2 -- I only see
two tasks created for the save() operation and those tasks appear to
execute the map() operation as well.

Assuming what I'm seeing is as-specified (meaning, how things are meant to
be), what's the recommended way to force a parallelism of 100 on the map()
operation?

thanks!

Mime
View raw message