spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Rosen <rosenvi...@gmail.com>
Subject Re: PySpark - Dill serialization
Date Fri, 06 Dec 2013 04:02:49 GMT
Thanks for the link!  I wasn't aware of Dill, but it looks like a nice
library.  I like that it's being actively developed:
https://github.com/uqfoundation/dill

It also seems to work correctly for a few edge-cases that cloudpickle
didn't handle properly, such as serializing operator.itemgetter instances
(see https://spark-project.atlassian.net/browse/SPARK-791).

I'll put together a pull request to replace CloudPickle with Dill.  Dill
uses a 3-clause BSD license, so we should be able to package it into an
.egg in the python/lib/ folder like we did for Py4J.  It will be
interesting to see whether the change has any performance impact, although
the recent custom serializers pull request should help with that since it
would let us use Dill for serializing functions and a faster serializer for
serializing data.

- Josh




On Thu, Dec 5, 2013 at 4:49 AM, Nick Pentreath <nick.pentreath@gmail.com>wrote:

> Hi devs
>
> I came across Dill (
> http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill) for Python
> serialization. Was wondering if it may be a replacement to the cloudpickle
> stuff (and remove that piece of code that needs to be maintained within
> PySpark)?
>
> Josh have you looked into Dill? Any thoughts?
>
> N
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message