spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@cs.berkeley.edu>
Subject Re: File broadcasting
Date Wed, 11 Sep 2013 13:46:39 GMT
Spark provides an abstraction called broadcast variables. It has multiple
underlying implementations, and can be much more convenient that Hadoop
distributed cache.

http://spark.incubator.apache.org/docs/0.7.3/scala-programming-guide.html#broadcast-variables


--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org



On Wed, Sep 11, 2013 at 7:11 PM, Konstantin Abakumov
<rusabakumov@gmail.com>wrote:

>  Hello everyone!
>
> I am solving the task in which every cluster node, executing spark job,
> should have access to a big external file. The file is MaxMind GeoIP
> database and its size is around 15 megabytes. MaxMind's provided library
> permanently uses it for reading with random access. Of course, it just can
> be stored in hdfs, but accessing it for random reading will be quite
> inefficient.
>
> Hadoop mapreduce has DistributedCache module dedicated for this purpose.
> We can specify files in hdfs that will be required during job execution and
> they are copied to worker nodes before the job starts. So the job will
> efficiently access their copies on local machine.
>
> I didn't found simple and effective way of doing the same thing in spark.
> Is there any preferable way to do so?
>
> --
> Best regards,
> Konstantin Abakumov
>
>

Mime
View raw message