I was wondering if somebody could give me some suggestions about how to
handle this situation:
I have a spark program, in which it reads a 6GB file first (Not RDD)
locally, and then do the map/reduce tasks. This 6GB file contains
information that will be shared by all the map tasks. Previously, I handled
it using the broadcast function in Spark, which is like this:
global_file = fileRead("filename")
rdd.map(ele => MapFunc(ele))
However, when running the spark program with a cluster of multiple
computers, I found that the remote nodes waited forever for the broadcasting
of the global_file. I think that it may not be a good solution to have each
map task to load the global file by themselves, which would incur huge
Actually, we have this global file in each node of our cluster. The ideal
behavior I hope is that for each node, they can read this global file only
from its local disk (and stay in memory), and then for all the map/reduce
tasks scheduled to this node, it can share that data. Hence, the global file
is neither like broadcasting variables, which is shared by all map/reduce
tasks, nor private variables only seen by one map task. It is shared
node-widely, which is read in each node only one time and shared by all the
tasks mapped to this node.
Could anybody tell me how to program in Spark to handle it? Thanks so much.
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-handle-this-situation-Huge-File-Shared-by-All-maps-and-Each-Computer-Has-one-copy-tp5139.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.