spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benyi Wang <>
Subject Best practice for join
Date Tue, 04 Nov 2014 20:23:46 GMT
I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did,

# build (K,V) from A and B to prepare the join

val ja = r => (K1, Va))
val jb = r => (K1, Vb))

# join A, B

val jab = ja.join(jb)

# build (K,V) from the joined result of A and B to prepare joining with C

val jc = => (K2, Vc))
jab.join(jc).map( => (K,V) ).reduceByKey(_ + _)

Because A may have multiple fields, so Va is a tuple with more than 2
fields. It is said that scala Tuple may not be specialized, and there is
boxing/unboxing issue, so I tried to use "case class" for Va, Vb, and Vc,
K2 and K which are compound keys, and V is a pair of count and ratio, _+_
will create a new ratio. I register those case classes in Kryo.

The sizes of Shuffle read/write look smaller. But I found GC overhead is
really high: GC Time is about 20~30% of duration for the reduceByKey task.
I think a lot of new objects are created using case classes during

How to make the thing better?

View raw message