Hello,
I am trying to do a groupBy on 5 attributes to get results in a form like a
pivot table in microsoft excel. The keys are the attribute tuples and
values are double arrays(maybe very large). Based on the code below, I am
getting back correct results, but would like to optimize it further(I
played around with numPartitions).
The two issues I see are 
1. flatMap is needed to expand the key tuples, but this also duplicates the
values, and as the values are large this increases the shuffle input size
for reduceByKey  is there a way to avoid the duplication?
2. reduceByKey is adding two arrays element wise, and creates a new array
for every addition, is there a way to reduce by not creating a new array
everytime(Similar to what accumulators do)?
I am pasting a sample code, query plan and output below.
Thanks.
val attributeToFloatArrayRDD = sc.parallelize(Array(
("A1", "B2", "C1", "D1", "E1") > (0.0 to 1000.0 by 0.25).toArray
, ("A2", "B1", "C1", "D2", "E1") > (5.0 to 1005.0 by 0.25).toArray
, ("A1", "B1", "C1", "D1", "E1") > (0.0 to 1000.0 by 0.25).toArray
, ("A3", "B3", "C1", "D1", "E2") > (0.0 to 1000.0 by 0.25).toArray
, ("A1", "B1", "C1", "D1", "E1") > (0.0 to 1000.0 by 0.25).toArray
, ("A4", "B3", "C1", "D1", "E1") > (8.0 to 1008.0 by 0.25).toArray
, ("A1", "B1", "C1", "D1", "E1") > (0.0 to 1000.0 by 0.25).toArray
))
val groupToVaRRDD = attributeToFloatArrayRDD
.flatMap(x => x._1 match {
case (t1, t2, t3, t4, t5) => Array((t1+"_top"), (t1, t2), (t1,
t2, t3), (t1, t2, t3, t4), (t1, t2, t3, t4, t5)).map(y => (y, x._2))
})
.reduceByKey((x, y) => {
require(x.size == y.size)
(x,y).zipped.map(_ + _)
})
.map(x => {
(x._1, x._2.sorted.take(x._2.size/20).last)
})
==== Query Plan
(16) MappedRDD[12] at map at GroupByTest.scala:81 []
 ShuffledRDD[11] at reduceByKey at GroupByTest.scala:76 []
+(16) FlatMappedRDD[10] at flatMap at GroupByTest.scala:68 []
 ParallelCollectionRDD[9] at parallelize at GroupByTest.scala:56 []
==== Output
GroupBy VaR
(A2,B1) 54.75
(A2,B1,C1,D2) 54.75
(A1,B1) 149.25
(A1,B1,C1,D1,E1) 149.25
(A3,B3,C1) 49.75
(A3,B3) 49.75
(A4,B3,C1,D1,E1) 57.75
(A2,B1,C1) 54.75
(A1,B2,C1,D1,E1) 49.75
(A1,B1,C1,D1) 149.25
(A3,B3,C1,D1,E2) 49.75
(A1,B2,C1) 49.75
(A3,B3,C1,D1) 49.75
(A4,B3) 57.75
(A1,B1,C1) 149.25
A1_top 199.0
(A4,B3,C1,D1) 57.75
A2_top 54.75
(A1,B2) 49.75
(A4,B3,C1) 57.75
A3_top 49.75
A4_top 57.75
(A2,B1,C1,D2,E1) 54.75
(A1,B2,C1,D1) 49.75
