spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Abakumov <>
Subject Re: File broadcasting
Date Fri, 13 Sep 2013 17:06:42 GMT
Thank you for the answer!

Broadcasting variables is not actually that we need, because it necessary to broadcast the
file and have access to it as a file on cluster nodes - this is the way how the library works.
But the solution is simple - there is a method addFile in SparkContext, which I missed at
the moment of asking the question, my fault(

Best regards,
Konstantin Abakumov

On Wednesday, 11 September 2013 г. at 17:46, Reynold Xin wrote:

> Spark provides an abstraction called broadcast variables. It has multiple underlying
implementations, and can be much more convenient that Hadoop distributed cache.  
> --
> Reynold Xin, AMPLab, UC Berkeley
> (
> On Wed, Sep 11, 2013 at 7:11 PM, Konstantin Abakumov < (>
> > Hello everyone!
> >  
> > I am solving the task in which every cluster node, executing spark job, should have
access to a big external file. The file is MaxMind GeoIP database and its size is around 15
megabytes. MaxMind's provided library permanently uses it for reading with random access.
Of course, it just can be stored in hdfs, but accessing it for random reading will be quite
> >  
> > Hadoop mapreduce has DistributedCache module dedicated for this purpose. We can
specify files in hdfs that will be required during job execution and they are copied to worker
nodes before the job starts. So the job will efficiently access their copies on local machine.
> >  
> > I didn't found simple and effective way of doing the same thing in spark. Is there
any preferable way to do so?  
> >  
> > --  
> > Best regards,
> > Konstantin Abakumov
> >  

View raw message