spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhan Zhang <zzh...@hortonworks.com>
Subject RDD.map does not allowed to preservesPartitioning?
Date Thu, 26 Mar 2015 21:44:23 GMT
Hi Folks,

Does anybody know what is the reason not allowing preserverPartitioning in RDD.map? Do I miss
something here?

Following example involves two shuffles. I think if preservePartitioning is allowed, we can
avoid the second one, right?

 val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
 val r2 = r1.map((_, 1))
 val r3 = r2.reduceByKey(_+_)
 val r4 = r3.map(x=>(x._1, x._2 + 1))
 val r5 = r4.reduceByKey(_+_)
 r5.collect.foreach(println)

scala> r5.toDebugString
res2: String =
(8) ShuffledRDD[4] at reduceByKey at <console>:29 []
 +-(8) MapPartitionsRDD[3] at map at <console>:27 []
    |  ShuffledRDD[2] at reduceByKey at <console>:25 []
    +-(8) MapPartitionsRDD[1] at map at <console>:23 []
       |  ParallelCollectionRDD[0] at parallelize at <console>:21 []

Thanks.

Zhan Zhang

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message