spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shuai Zheng" <>
Subject RE: Any "Replicated" RDD in Spark?
Date Thu, 06 Nov 2014 16:13:14 GMT

Thanks for reply.

I don't worry that much about more code because I migrate from mapreduce, so
I have existing code to handle it. But if I want to use a new tech, I will
always prefer "right" way not a temporary easy way!. I will go with RDD
first to test the performance.



-----Original Message-----
From: Matei Zaharia [] 
Sent: Wednesday, November 05, 2014 6:27 PM
To: Shuai Zheng
Subject: Re: Any "Replicated" RDD in Spark?

If you start with an RDD, you do have to collect to the driver and broadcast
to do this. Between the two options you listed, I think this one is simpler
to implement, and there won't be a huge difference in performance, so you
can go for it. Opening InputStreams to a distributed file system by hand can
be a lot of code.


> On Nov 5, 2014, at 12:37 PM, Shuai Zheng <> wrote:
> And another similar case:
> If I have get a RDD from previous step, but for next step it should be 
> a map side join (so I need to broadcast this RDD to every nodes). What 
> is the best way for me to do that? Collect RDD in driver first and 
> create broadcast? Or any shortcut in spark for this?
> Thanks!
> -----Original Message-----
> From: Shuai Zheng []
> Sent: Wednesday, November 05, 2014 3:32 PM
> To: 'Matei Zaharia'
> Cc: ''
> Subject: RE: Any "Replicated" RDD in Spark?
> Nice.
> Then I have another question, if I have a file (or a set of files: 
> part-0, part-1, might be a few hundreds MB csv to 1-2 GB, created by 
> other program), need to create hashtable from it, later broadcast it 
> to each node to allow query (map side join). I have two options to do it:
> 1, I can just load the file in a general code (open a inputstream, 
> etc), parse content and then create the broadcast from it.
> 2, I also can use a standard way to create the RDD from these file, 
> run the map to parse it, then collect it as map, wrap the result as 
> broadcast to push to all nodes again.
> I think the option 2 might be more consistent with spark's concept 
> (and less code?)? But how about the performance? The gain is can 
> parallel load and parse the data, penalty is after load we need to 
> collect and broadcast result again? Please share your opinion. I am 
> not sure what is the best practice here (in theory, either way works, 
> but in real world, which one is better?).
> Regards,
> Shuai
> -----Original Message-----
> From: Matei Zaharia []
> Sent: Monday, November 03, 2014 4:15 PM
> To: Shuai Zheng
> Cc:
> Subject: Re: Any "Replicated" RDD in Spark?
> You need to use broadcast followed by flatMap or mapPartitions to do 
> map-side joins (in your map function, you can look at the hash table 
> you broadcast and see what records match it). Spark SQL also does it 
> by default for tables smaller than the 
> spark.sql.autoBroadcastJoinThreshold setting (by default 10 KB, which 
> is really small, but you can bump this up with set
> spark.sql.autoBroadcastJoinThreshold=1000000 for example).
> Matei
>> On Nov 3, 2014, at 1:03 PM, Shuai Zheng <> wrote:
>> Hi All,
>> I have spent last two years on hadoop but new to spark.
>> I am planning to move one of my existing system to spark to get some
> enhanced features.
>> My question is:
>> If I try to do a map side join (something similar to "Replicated" key 
>> word
> in Pig), how can I do it? Is it anyway to declare a RDD as "replicated"
> (means distribute it to all nodes and each node will have a full copy)?
>> I know I can use accumulator to get this feature, but I am not sure 
>> what
> is the best practice. And if I accumulator to broadcast the data set, 
> can then (after broadcast) convert it into a RDD and do the join?
>> Regards,
>> Shuai
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: For 
> additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message