spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: File broadcasting
Date Wed, 11 Sep 2013 13:46:39 GMT
Spark provides an abstraction called broadcast variables. It has multiple
underlying implementations, and can be much more convenient that Hadoop
distributed cache.

Reynold Xin, AMPLab, UC Berkeley

On Wed, Sep 11, 2013 at 7:11 PM, Konstantin Abakumov

>  Hello everyone!
> I am solving the task in which every cluster node, executing spark job,
> should have access to a big external file. The file is MaxMind GeoIP
> database and its size is around 15 megabytes. MaxMind's provided library
> permanently uses it for reading with random access. Of course, it just can
> be stored in hdfs, but accessing it for random reading will be quite
> inefficient.
> Hadoop mapreduce has DistributedCache module dedicated for this purpose.
> We can specify files in hdfs that will be required during job execution and
> they are copied to worker nodes before the job starts. So the job will
> efficiently access their copies on local machine.
> I didn't found simple and effective way of doing the same thing in spark.
> Is there any preferable way to do so?
> --
> Best regards,
> Konstantin Abakumov

View raw message