spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Any "Replicated" RDD in Spark?
Date Mon, 03 Nov 2014 21:14:52 GMT
You need to use broadcast followed by flatMap or mapPartitions to do map-side joins (in your
map function, you can look at the hash table you broadcast and see what records match it).
Spark SQL also does it by default for tables smaller than the spark.sql.autoBroadcastJoinThreshold
setting (by default 10 KB, which is really small, but you can bump this up with set spark.sql.autoBroadcastJoinThreshold=1000000
for example).

Matei

> On Nov 3, 2014, at 1:03 PM, Shuai Zheng <szheng.code@gmail.com> wrote:
> 
> Hi All,
> 
> I have spent last two years on hadoop but new to spark.
> I am planning to move one of my existing system to spark to get some enhanced features.
> 
> My question is:
> 
> If I try to do a map side join (something similar to "Replicated" key word in Pig), how
can I do it? Is it anyway to declare a RDD as "replicated" (means distribute it to all nodes
and each node will have a full copy)?
> 
> I know I can use accumulator to get this feature, but I am not sure what is the best
practice. And if I accumulator to broadcast the data set, can then (after broadcast) convert
it into a RDD and do the join?
> 
> Regards,
> 
> Shuai


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message