spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <>
Subject Re: PySpark, numpy arrays and binary data
Date Thu, 07 Aug 2014 22:24:15 GMT
On Thu, Aug 7, 2014 at 12:06 AM, Rok Roskar <> wrote:
> sure, but if you knew that a numpy array went in on one end, you could safely use it
on the other end, no? Perhaps it would require an extension of the RDD class and overriding
the colect() method.
>> Could you give a short example about how numpy array is used in your project?
> sure -- basically our main data structure is a container class (acts like a dictionary)
that holds various arrays that represent particle data. Each particle has various properties,
position, velocity, mass etc. you get at these individual properties by calling something
> s['pos']
> where 's' is the container object and 'pos' is the name of the array. A really common
use case then is to select particles based on their properties and do some plotting, or take
a slice of the particles, e.g. you might do
> r = np.sqrt((s['pos']**2).sum(axis=1))
> ind = np.where(r < 5)
> plot(s[ind]['x'], s[ind]['y'])
> Internally, the various arrays are kept in a dictionary -- I'm hoping to write a class
that keeps them in an RDD instead. To the user, this would have to be transparent, i.e. if
the user wants to get at the data for specific particles, she would just have to do
> s['pos'][1,5,10]
> for example, and the data would be fetched for her from the RDD just like it would be
if she were simply using the usual single-machine version. This is why the writing to/from
files when retrieving data from the RDD really is a no-go -- can you recommend how this can
be circumvented?

RDD is expected as distributed, so accessing the items in RDD by key
or indices directly will not be easy. So I think you can not mapping
this interface to an RDD, or the result will be what user expected,
such as very very slow.

In order to parallelize the computation, most of them should be done
by transformation of RDDs. Finally, fetch the data from RDD by
collect(), then do the plotting stuff. Can this kind of work flow work
for you cases?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message