spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?
Date Fri, 02 May 2014 12:50:36 GMT
Example ...

val pageNames = sc.textFile(“pages.txt”).map(...)
val pageMap = pageNames.collect().toMap()
val bc = sc.broadcast(pageMap)
val visits = sc.textFile(“visits.txt”).map(...)
val joined = visits.map(v => (v._1, (bc.value(v._1), v._2)))

in this you are looking up pagenames in visits & translating it using the
pages.txt  mapping file.



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Fri, May 2, 2014 at 4:16 AM, PengWeiPRC <peng.wei.prc@gmx.com> wrote:

> Thanks, Rustagi. Yes, the global data is read-only and stays from the
> beginning to the end of the whole Spark task. Actually, it is not only
> identical for one Map/Reduce task, but used by a lot of map/reduce tasks of
> mine. That's why I intend to put the data into each node of my cluster, and
> hope to see if it is possible for a Spark Map/Reduce program to let all the
> nodes read it simultaneously from their local disks rather than read it by
> one node and broadcast to other nodes. Any suggestions for solving it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-handle-this-situation-Huge-File-Shared-by-All-maps-and-Each-Computer-Has-one-copy-tp5139p5192.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message