Something like this:

object Model {
   @transient lazy val modelObject = new ModelLoader("model-filename")

   def get() = modelObject

object SparkJob {
  def main(args: Array[String]) = {

     sc.parallelize(…).map(test => {

On Thu, Sep 28, 2017 at 3:49 PM, Vadim Semenov <> wrote:
as an alternative
spark-submit --files <file you want>

the files will be put on each executor in the working directory, so you can then load it alongside your `map` function

Behind the scene it uses `SparkContext.addFile` method that you can use too✓#L1508-L1558

On Wed, Sep 27, 2017 at 10:08 PM, Naveen Swamy <> wrote:
Hello all,

I am a new user to Spark, please bear with me if this has been discussed earlier. 

I am trying to run batch inference using DL frameworks pre-trained models and Spark. Basically, I want to download a model(which is usually ~500 MB) onto the workers and load the model and run inference on images fetched from the source like S3 something like this
rdd = sc.parallelize(load_from_s3)

I was able to get it running in local mode on Jupyter, However, I would like to load the model only once and not every map operation. A setup hook would have nice which loads the model once into the JVM, I came across this JIRA  which suggests that I can use Singleton and static initialization. I tried to do this using a Singleton metaclass following the thread here Following this failed miserably complaining that Spark cannot serialize ctype objects with pointer references.

After a lot of trial and error, I moved the code to a separate file by creating a static method for predict that checks if a class variable is set or not and loads the model if not set. This approach does not sound thread safe to me, So I wanted to reach out and see if there are established patterns on how to achieve something like this.

Also, I would like to understand the executor->tasks->python process mapping, Does each task gets mapped to a separate python process?  The reason I ask is I want to be to use mapPartition method to load a batch of files and run inference on them separately for which I need to load the object once per task. Any 

Thanks for your time in answering my question.

Cheers, Naveen