spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Any "Replicated" RDD in Spark?
Date Wed, 05 Nov 2014 23:27:00 GMT
If you start with an RDD, you do have to collect to the driver and broadcast to do this. Between
the two options you listed, I think this one is simpler to implement, and there won't be a
huge difference in performance, so you can go for it. Opening InputStreams to a distributed
file system by hand can be a lot of code.

Matei

> On Nov 5, 2014, at 12:37 PM, Shuai Zheng <szheng.code@gmail.com> wrote:
> 
> And another similar case:
> 
> If I have get a RDD from previous step, but for next step it should be a map
> side join (so I need to broadcast this RDD to every nodes). What is the best
> way for me to do that? Collect RDD in driver first and create broadcast? Or
> any shortcut in spark for this?
> 
> Thanks!
> 
> -----Original Message-----
> From: Shuai Zheng [mailto:szheng.code@gmail.com] 
> Sent: Wednesday, November 05, 2014 3:32 PM
> To: 'Matei Zaharia'
> Cc: 'user@spark.apache.org'
> Subject: RE: Any "Replicated" RDD in Spark?
> 
> Nice.
> 
> Then I have another question, if I have a file (or a set of files: part-0,
> part-1, might be a few hundreds MB csv to 1-2 GB, created by other program),
> need to create hashtable from it, later broadcast it to each node to allow
> query (map side join). I have two options to do it:
> 
> 1, I can just load the file in a general code (open a inputstream, etc),
> parse content and then create the broadcast from it. 
> 2, I also can use a standard way to create the RDD from these file, run the
> map to parse it, then collect it as map, wrap the result as broadcast to
> push to all nodes again.
> 
> I think the option 2 might be more consistent with spark's concept (and less
> code?)? But how about the performance? The gain is can parallel load and
> parse the data, penalty is after load we need to collect and broadcast
> result again? Please share your opinion. I am not sure what is the best
> practice here (in theory, either way works, but in real world, which one is
> better?). 
> 
> Regards,
> 
> Shuai
> 
> -----Original Message-----
> From: Matei Zaharia [mailto:matei.zaharia@gmail.com] 
> Sent: Monday, November 03, 2014 4:15 PM
> To: Shuai Zheng
> Cc: user@spark.apache.org
> Subject: Re: Any "Replicated" RDD in Spark?
> 
> You need to use broadcast followed by flatMap or mapPartitions to do
> map-side joins (in your map function, you can look at the hash table you
> broadcast and see what records match it). Spark SQL also does it by default
> for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by
> default 10 KB, which is really small, but you can bump this up with set
> spark.sql.autoBroadcastJoinThreshold=1000000 for example).
> 
> Matei
> 
>> On Nov 3, 2014, at 1:03 PM, Shuai Zheng <szheng.code@gmail.com> wrote:
>> 
>> Hi All,
>> 
>> I have spent last two years on hadoop but new to spark.
>> I am planning to move one of my existing system to spark to get some
> enhanced features.
>> 
>> My question is:
>> 
>> If I try to do a map side join (something similar to "Replicated" key word
> in Pig), how can I do it? Is it anyway to declare a RDD as "replicated"
> (means distribute it to all nodes and each node will have a full copy)?
>> 
>> I know I can use accumulator to get this feature, but I am not sure what
> is the best practice. And if I accumulator to broadcast the data set, can
> then (after broadcast) convert it into a RDD and do the join?
>> 
>> Regards,
>> 
>> Shuai
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message