spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pala M Muthaia <mchett...@rocketfuelinc.com>
Subject Broadcast joins on RDD
Date Mon, 12 Jan 2015 23:15:35 GMT
Hi,


How do i do broadcast/map join on RDDs? I have a large RDD that i want to
inner join with a small RDD. Instead of having the large RDD repartitioned
and shuffled for join, i would rather send a copy of a small RDD to each
task, and then perform the join locally.

How would i specify this in Spark code? I didn't find much documentation
online. I attempted to create a broadcast variable out of the small RDD and
then access that in the join operator:

largeRdd.join(smallRddBroadCastVar.value)

but that didn't work as expected ( I found that all rows with same key were
on same task)

I am using Spark version 1.0.1


Thanks,
pala

Mime
View raw message