spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Abakumov <>
Subject File broadcasting
Date Wed, 11 Sep 2013 11:11:38 GMT
Hello everyone!

I am solving the task in which every cluster node, executing spark job, should have access
to a big external file. The file is MaxMind GeoIP database and its size is around 15 megabytes.
MaxMind's provided library permanently uses it for reading with random access. Of course,
it just can be stored in hdfs, but accessing it for random reading will be quite inefficient.

Hadoop mapreduce has DistributedCache module dedicated for this purpose. We can specify files
in hdfs that will be required during job execution and they are copied to worker nodes before
the job starts. So the job will efficiently access their copies on local machine.

I didn't found simple and effective way of doing the same thing in spark. Is there any preferable
way to do so? 

Best regards,
Konstantin Abakumov

View raw message