I don't have any strong views, so just to highlight possible issues:

On 02/14/2017 12:22 AM, Holden Karau wrote:
It's a good question. Py4J seems to have been updated 5 times in 2016 and is a bit involved (from a review point of view verifying the zip file contents is somewhat tedious).

cloudpickle is a bit difficult to tell since we can have changes to cloudpickle which aren't correctly tagged as backporting changes from the fork (and this can take awhile to review since we don't always catch them right away as being backports).

Another difficulty with looking at backports is that since our review process for PySpark has historically been on the slow side, changes benefiting systems like dask or IPython parallel were not backported to Spark unless they caused serious errors.

I think the key benefits are better test coverage of the forked version of cloudpickle, using a more standardized packaging of dependencies, simpler updates of dependencies reduces friction to gaining benefits from other related projects work - Python serialization really isn't our secret sauce.

If I'm missing any substantial benefits or costs I'd love to know :)

On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <rxin@databricks.com> wrote:
With any dependency update (or refactoring of existing code), I always ask this question: what's the benefit? In this case it looks like the benefit is to reduce efforts in backports. Do you know how often we needed to do those?

On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau <holden@pigscanfly.ca> wrote:
Hi PySpark Developers,

Cloudpickle is a core part of PySpark, and is originally copied from (and improved from) picloud. Since then other projects have found cloudpickle useful and a fork of cloudpickle was created and is now maintained as its own library (with better test coverage and resulting bug fixes I understand). We've had a few PRs backporting fixes from the cloudpickle project into Spark's local copy of cloudpickle - how would people feel about moving to taking an explicit (pinned) dependency on cloudpickle?

We could add cloudpickle to the setup.py and a requirements.txt file for users who prefer not to do a system installation of PySpark.

Py4J is maybe even a simpler case, we currently have a zip of py4j in our repo but could instead have a pinned version required. While we do depend on a lot of py4j internal APIs, version pinning should be sufficient to ensure functionality (and simplify the update process).


Holden :)


Cell : 425-233-8271

Maciej Szymkiewicz