Example ...

val pageNames = sc.textFile(“pages.txt”).map(...) 
val pageMap = pageNames.collect().toMap() 
val bc = sc.broadcast(pageMap) 
val visits = sc.textFile(“visits.txt”).map(...) 
val joined = visits.map(v => (v._1, (bc.value(v._1), v._2))) 

in this you are looking up pagenames in visits & translating it using the pages.txt  mapping file. 



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi



On Fri, May 2, 2014 at 4:16 AM, PengWeiPRC <peng.wei.prc@gmx.com> wrote:
Thanks, Rustagi. Yes, the global data is read-only and stays from the
beginning to the end of the whole Spark task. Actually, it is not only
identical for one Map/Reduce task, but used by a lot of map/reduce tasks of
mine. That's why I intend to put the data into each node of my cluster, and
hope to see if it is possible for a Spark Map/Reduce program to let all the
nodes read it simultaneously from their local disks rather than read it by
one node and broadcast to other nodes. Any suggestions for solving it?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-handle-this-situation-Huge-File-Shared-by-All-maps-and-Each-Computer-Has-one-copy-tp5139p5192.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.