spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?
Date Thu, 01 May 2014 11:51:19 GMT
Broadcast variable is meant to be shared across each node & not map tasks.
The process you are using should work, however having 6GB of broadcast
variable could be an issue. Does the broadcast variable finally move or
always stays stuck?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, May 1, 2014 at 10:07 AM, PengWeiPRC <peng.wei.prc@gmx.com> wrote:

> Hi there,
>
> I was wondering if somebody could give me some suggestions about how to
> handle this situation:
>
>   I have a spark program, in which it reads a 6GB file first (Not RDD)
> locally, and then do the map/reduce tasks. This 6GB file contains
> information that will be shared by all the map tasks. Previously, I handled
> it using the broadcast function in Spark, which is like this:
>     global_file = fileRead("filename")
>     global_file.broadcast()
>     rdd.map(ele => MapFunc(ele))
>
>   However, when running the spark program with a cluster of multiple
> computers, I found that the remote nodes waited forever for the
> broadcasting
> of the global_file. I think that it may not be a good solution to have each
> map task to load the global file by themselves, which would incur huge
> overhead.
>
>   Actually, we have this global file in each node of our cluster. The ideal
> behavior I hope is that for each node, they can read this global file only
> from its local disk (and stay in memory), and then for all the map/reduce
> tasks scheduled to this node, it can share that data. Hence, the global
> file
> is neither like broadcasting variables, which is shared by all map/reduce
> tasks, nor private variables only seen by one map task. It is shared
> node-widely, which is read in each node only one time and shared by all the
> tasks mapped to this node.
>
> Could anybody tell me how to program in Spark to handle it? Thanks so much.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-handle-this-situation-Huge-File-Shared-by-All-maps-and-Each-Computer-Has-one-copy-tp5139.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message