spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: PySpark - Dill serialization
Date Fri, 06 Dec 2013 04:15:09 GMT
Looks cool! Josh, if you replace CloudPickle with this, make sure to also update the LICENSE
file, which is supposed to contain third-party licenses.

Matei

On Dec 5, 2013, at 8:02 PM, Josh Rosen <rosenville@gmail.com> wrote:

> Thanks for the link!  I wasn't aware of Dill, but it looks like a nice
> library.  I like that it's being actively developed:
> https://github.com/uqfoundation/dill
> 
> It also seems to work correctly for a few edge-cases that cloudpickle
> didn't handle properly, such as serializing operator.itemgetter instances
> (see https://spark-project.atlassian.net/browse/SPARK-791).
> 
> I'll put together a pull request to replace CloudPickle with Dill.  Dill
> uses a 3-clause BSD license, so we should be able to package it into an
> .egg in the python/lib/ folder like we did for Py4J.  It will be
> interesting to see whether the change has any performance impact, although
> the recent custom serializers pull request should help with that since it
> would let us use Dill for serializing functions and a faster serializer for
> serializing data.
> 
> - Josh
> 
> 
> 
> 
> On Thu, Dec 5, 2013 at 4:49 AM, Nick Pentreath <nick.pentreath@gmail.com>wrote:
> 
>> Hi devs
>> 
>> I came across Dill (
>> http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill) for Python
>> serialization. Was wondering if it may be a replacement to the cloudpickle
>> stuff (and remove that piece of code that needs to be maintained within
>> PySpark)?
>> 
>> Josh have you looked into Dill? Any thoughts?
>> 
>> N
>> 


Mime
View raw message