spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Parthus <>
Subject What if there are large, read-only variables shared by all map functions?
Date Tue, 22 Jul 2014 19:54:03 GMT
Hi there,

I was wondering if anybody could help me find an efficient way to make a
MapReduce program like this:

1) For each map function, it need access some huge files, which is around

2) These files are READ-ONLY. Actually they are like some huge look-up
table, which will not change during 2~3 years.

I tried two ways to make the program work, but neither of them is efficient:

1) The first approach I tried is to let each map function load those files
independently, like this:

map (...) { load(files); DoMapTask(...)}

2) The second approach I tried is to load the files before and
broadcast the files. However, because the files are too large, the
broadcasting overhead is 30min ~ 1 hour.

Could anybody help me find an efficient way to solve it?

Thanks very much.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message