spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shuai Zheng" <>
Subject RE: Any "Replicated" RDD in Spark?
Date Wed, 05 Nov 2014 20:32:15 GMT

Then I have another question, if I have a file (or a set of files: part-0,
part-1, might be a few hundreds MB csv to 1-2 GB, created by other program),
need to create hashtable from it, later broadcast it to each node to allow
query (map side join). I have two options to do it:

1, I can just load the file in a general code (open a inputstream, etc),
parse content and then create the broadcast from it. 
2, I also can use a standard way to create the RDD from these file, run the
map to parse it, then collect it as map, wrap the result as broadcast to
push to all nodes again.

I think the option 2 might be more consistent with spark's concept (and less
code?)? But how about the performance? The gain is can parallel load and
parse the data, penalty is after load we need to collect and broadcast
result again? Please share your opinion. I am not sure what is the best
practice here (in theory, either way works, but in real world, which one is



-----Original Message-----
From: Matei Zaharia [] 
Sent: Monday, November 03, 2014 4:15 PM
To: Shuai Zheng
Subject: Re: Any "Replicated" RDD in Spark?

You need to use broadcast followed by flatMap or mapPartitions to do
map-side joins (in your map function, you can look at the hash table you
broadcast and see what records match it). Spark SQL also does it by default
for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by
default 10 KB, which is really small, but you can bump this up with set
spark.sql.autoBroadcastJoinThreshold=1000000 for example).


> On Nov 3, 2014, at 1:03 PM, Shuai Zheng <> wrote:
> Hi All,
> I have spent last two years on hadoop but new to spark.
> I am planning to move one of my existing system to spark to get some
enhanced features.
> My question is:
> If I try to do a map side join (something similar to "Replicated" key word
in Pig), how can I do it? Is it anyway to declare a RDD as "replicated"
(means distribute it to all nodes and each node will have a full copy)?
> I know I can use accumulator to get this feature, but I am not sure what
is the best practice. And if I accumulator to broadcast the data set, can
then (after broadcast) convert it into a RDD and do the join?
> Regards,
> Shuai

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message