spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: What if there are large, read-only variables shared by all map functions?
Date Thu, 24 Jul 2014 03:39:05 GMT
Have a look at broadcast variables .


On Tuesday, July 22, 2014, Parthus <peng.wei.prc@gmail.com> wrote:

> Hi there,
>
> I was wondering if anybody could help me find an efficient way to make a
> MapReduce program like this:
>
> 1) For each map function, it need access some huge files, which is around
> 6GB
>
> 2) These files are READ-ONLY. Actually they are like some huge look-up
> table, which will not change during 2~3 years.
>
> I tried two ways to make the program work, but neither of them is
> efficient:
>
> 1) The first approach I tried is to let each map function load those files
> independently, like this:
>
> map (...) { load(files); DoMapTask(...)}
>
> 2) The second approach I tried is to load the files before RDD.map(...) and
> broadcast the files. However, because the files are too large, the
> broadcasting overhead is 30min ~ 1 hour.
>
> Could anybody help me find an efficient way to solve it?
>
> Thanks very much.
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/What-if-there-are-large-read-only-variables-shared-by-all-map-functions-tp10435.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


-- 
Sent from Gmail Mobile

Mime
View raw message