spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Best practice for join
Date Wed, 05 Nov 2014 06:54:29 GMT
How about Using SparkSQL <https://spark.apache.org/sql/>?

Thanks
Best Regards

On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang <bewang.tech@gmail.com> wrote:

> I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did,
>
> # build (K,V) from A and B to prepare the join
>
> val ja = A.map( r => (K1, Va))
> val jb = B.map( r => (K1, Vb))
>
> # join A, B
>
> val jab = ja.join(jb)
>
> # build (K,V) from the joined result of A and B to prepare joining with C
>
> val jc = C.map(r => (K2, Vc))
> jab.join(jc).map( => (K,V) ).reduceByKey(_ + _)
>
> Because A may have multiple fields, so Va is a tuple with more than 2
> fields. It is said that scala Tuple may not be specialized, and there is
> boxing/unboxing issue, so I tried to use "case class" for Va, Vb, and Vc,
> K2 and K which are compound keys, and V is a pair of count and ratio, _+_
> will create a new ratio. I register those case classes in Kryo.
>
> The sizes of Shuffle read/write look smaller. But I found GC overhead is
> really high: GC Time is about 20~30% of duration for the reduceByKey task.
> I think a lot of new objects are created using case classes during
> map/reduce.
>
> How to make the thing better?
>

Mime
View raw message