spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: PySpark, numpy arrays and binary data
Date Wed, 06 Aug 2014 16:24:56 GMT
numpy array only can support basic types, so we can not use it during collect()
by default.

Could you give a short example about how numpy array is used in your project?

On Wed, Aug 6, 2014 at 8:41 AM, Rok Roskar <rokroskar@gmail.com> wrote:
> Hello,
>
> I'm interested in getting started with Spark to scale our scientific
> analysis package (http://pynbody.github.io) to larger data sets. The package
> is written in Python and makes heavy use of numpy/scipy and related
> frameworks. I've got a couple of questions that I have not been able to find
> easy answers to despite some research efforts... I hope someone here can
> clarify things for me a bit!
>
> * is there a preferred way to read binary data off a local disk directly
> into an RDD? Our I/O routines are built to read data in chunks and each
> chunk could be read by a different process/RDD, but it's not clear to me how
> to accomplish this with the existing API. Since the idea is to process data
> sets that don't fit into a single node's memory, reading first and then
> distributing via sc.parallelize is obviously not an option.

If you already know how to partition the data, then you could use
sc.parallelize()
to distribute the description of your data, then read the data in parallel by
given descriptions.

For examples, you can partition your data into (path, start, length), then

partitions = [(path1, start1, length), (path1, start2, length), ...]

def read_chunk(path, start, length):
      f = open(path)
      f.seek(start)
      data = f.read(length)
      #processing the data

rdd = sc.parallelize(partitions, len(partitions)).flatMap(read_chunk)

> * related to the first question -- when an RDD is created by parallelizing a
> numpy array, the array gets serialized and distributed. I see in the source
> that it actually gets written into a file first (!?) -- but surely the Py4J
> bottleneck for python array types (mentioned in the source comment) doesn't
> really apply to numpy arrays? Is it really necessary to dump the data onto
> disk first? Conversely, the collect() seems really slow and I suspect that
> this is due to the combination of disk I/O and python list creation. Are
> there any ways of getting around this if numpy arrays are being used?
>
>
> I'd be curious about any other best-practices tips anyone might have for
> running pyspark with numpy data...!
>
> Thanks!
>
>
> Rok
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message