spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <>
Subject Re: How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?
Date Fri, 02 May 2014 12:50:36 GMT
Example ...

val pageNames = sc.textFile(“pages.txt”).map(...)
val pageMap = pageNames.collect().toMap()
val bc = sc.broadcast(pageMap)
val visits = sc.textFile(“visits.txt”).map(...)
val joined = => (v._1, (bc.value(v._1), v._2)))

in this you are looking up pagenames in visits & translating it using the
pages.txt  mapping file.

Mayur Rustagi
Ph: +1 (760) 203 3257
@mayur_rustagi <>

On Fri, May 2, 2014 at 4:16 AM, PengWeiPRC <> wrote:

> Thanks, Rustagi. Yes, the global data is read-only and stays from the
> beginning to the end of the whole Spark task. Actually, it is not only
> identical for one Map/Reduce task, but used by a lot of map/reduce tasks of
> mine. That's why I intend to put the data into each node of my cluster, and
> hope to see if it is possible for a Spark Map/Reduce program to let all the
> nodes read it simultaneously from their local disks rather than read it by
> one node and broadcast to other nodes. Any suggestions for solving it?
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at

View raw message