spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Uang <justin.u...@gmail.com>
Subject Re: Pickle Spark DataFrame
Date Tue, 03 Nov 2015 21:17:49 GMT
Is the Manager a python multiprocessing manager? Why are you using
parallelism on python when theoretically most of the heavy lifting is done
via spark?

On Wed, Oct 28, 2015 at 4:27 PM agg212 <agg@cs.brown.edu> wrote:

> I would just like to be able to put a Spark DataFrame in a manager.dict()
> and
> be able to get it out (manager.dict() calls pickle on the object being
> stored).  Ideally, I would just like to store a pointer to the DataFrame
> object so that it remains distributed within Spark (i.e., not materialize
> and then store).  Here is an example:
>
> data = sparkContext.jsonFile(data_file) #load file
> cache = Manager.dict() #thread-safe container
> cache['id'] = data #store reference to data, not materialized result
> new_data = cache['id'] #get reference to distributed spark dataframe
> new_data.show()
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Pickle-Spark-DataFrame-tp14803p14825.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
View raw message