scala> rawLogs.partitions.size
res1: Int = 2171

On Tue, Jun 24, 2014 at 4:00 PM, Mayur Rustagi <> wrote:
To be clear number of map tasks are determined by number of partitions inside the rdd hence the suggestion by Nicholas. 

On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas <> wrote:

So do you get 2171 as the output for that command? That command tells you how many partitions your RDD has, so it’s good to first confirm that rdd1 has as many partitions as you think it has.

On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert <> wrote:
It's actually a set of 2171 S3 files, with an average size of about 18MB.

On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas <> wrote:

What do you get for rdd1._jrdd.splits().size()? You might think you’re getting > 100 partitions, but it may not be happening.

On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert <> wrote:
With the following pseudo-code,

val rdd1 = sc.sequenceFile(...) // has > 100 partitions
val rdd2 = rdd1.coalesce(100)
val rdd3 = rdd2 map { ... }
val rdd4 = rdd3.coalesce(2)
val rdd5 = rdd4.saveAsTextFile(...) // want only two output files

I would expect the parallelism of the map() operation to be 100 concurrent tasks, and the parallelism of the save() operation to be 2.

However, it appears the parallelism of the entire chain is 2 -- I only see two tasks created for the save() operation and those tasks appear to execute the map() operation as well.

Assuming what I'm seeing is as-specified (meaning, how things are meant to be), what's the recommended way to force a parallelism of 100 on the map() operation?