spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eike von Seggern <eike.segg...@sevenval.com>
Subject Re: Loading objects only once
Date Thu, 28 Sep 2017 07:34:39 GMT
Hello,

maybe broadcast can help you here. [1]

You can load the model once on the driver and then broadcast it to the
workers with `bc_model = sc.broadcast(model)`? You can access the model in
the map function with `bc_model.value()`.

Best

Eike

[1]
https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.broadcast

Naveen Swamy <mnnaveen@gmail.com> schrieb am Do., 28. Sep. 2017 um
04:09 Uhr:

> Hello all,
>
> I am a new user to Spark, please bear with me if this has been discussed
> earlier.
>
> I am trying to run batch inference using DL frameworks pre-trained models
> and Spark. Basically, I want to download a model(which is usually ~500 MB)
> onto the workers and load the model and run inference on images fetched
> from the source like S3 something like this
> rdd = sc.parallelize(load_from_s3)
> rdd.map(fetch_from_s3).map(read_file).map(predict)
>
> I was able to get it running in local mode on Jupyter, However, I would
> like to load the model only once and not every map operation. A setup hook
> would have nice which loads the model once into the JVM, I came across this
> JIRA https://issues.apache.org/jira/browse/SPARK-650  which suggests that
> I can use Singleton and static initialization. I tried to do this using
> a Singleton metaclass following the thread here
> https://stackoverflow.com/questions/6760685/creating-a-singleton-in-python.
> Following this failed miserably complaining that Spark cannot serialize
> ctype objects with pointer references.
>
> After a lot of trial and error, I moved the code to a separate file by
> creating a static method for predict that checks if a class variable is set
> or not and loads the model if not set. This approach does not sound thread
> safe to me, So I wanted to reach out and see if there are established
> patterns on how to achieve something like this.
>
>
> Also, I would like to understand the executor->tasks->python process
> mapping, Does each task gets mapped to a separate python process?  The
> reason I ask is I want to be to use mapPartition method to load a batch of
> files and run inference on them separately for which I need to load the
> object once per task. Any
>
>
> Thanks for your time in answering my question.
>
> Cheers, Naveen
>
>
>

Mime
View raw message