spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Szymkiewicz <>
Subject Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?
Date Tue, 14 Feb 2017 15:19:15 GMT
I don't have any strong views, so just to highlight possible issues:

  * Based on different issues I've seen there is a substantial amount of
    users which depend on system wide Python installations. As far as I
    am aware neither Py4j nor cloudpickle are present in the standard
    system repositories in Debian or Red Hat derivatives.
  * Assuming that Spark is committed to supporting Python 2 beyond its
    end of life we have to be sure that any external dependency has the
    same policy.
  * Py4j is missing from default Anaconda channel. Not a big issue, just
    a small annoyance.
  * External dependencies with pinned versions add some overhead to the
    development across versions (effectively we may need a separate env
    for each major Spark release). I've seen small inconsistencies in
    PySpark behavior with different Py4j versions so it is not
    completely hypothetical.
  * Adding possible version conflicts. It is probably not a big risk but
    something to consider (for example in combination Blaze + Dask +
  * Adding another party user has to trust.

On 02/14/2017 12:22 AM, Holden Karau wrote:
> It's a good question. Py4J seems to have been updated 5 times in 2016
> and is a bit involved (from a review point of view verifying the zip
> file contents is somewhat tedious).
> cloudpickle is a bit difficult to tell since we can have changes to
> cloudpickle which aren't correctly tagged as backporting changes from
> the fork (and this can take awhile to review since we don't always
> catch them right away as being backports).
> Another difficulty with looking at backports is that since our review
> process for PySpark has historically been on the slow side, changes
> benefiting systems like dask or IPython parallel were not backported
> to Spark unless they caused serious errors.
> I think the key benefits are better test coverage of the forked
> version of cloudpickle, using a more standardized packaging of
> dependencies, simpler updates of dependencies reduces friction to
> gaining benefits from other related projects work - Python
> serialization really isn't our secret sauce.
> If I'm missing any substantial benefits or costs I'd love to know :)
> On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <
> <>> wrote:
>     With any dependency update (or refactoring of existing code), I
>     always ask this question: what's the benefit? In this case it
>     looks like the benefit is to reduce efforts in backports. Do you
>     know how often we needed to do those?
>     On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau
>     < <>> wrote:
>         Hi PySpark Developers,
>         Cloudpickle is a core part of PySpark, and is originally
>         copied from (and improved from) picloud. Since then other
>         projects have found cloudpickle useful and a fork of
>         cloudpickle <> was
>         created and is now maintained as its own library
>         <> (with better test
>         coverage and resulting bug fixes I understand). We've had a
>         few PRs backporting fixes from the cloudpickle project into
>         Spark's local copy of cloudpickle - how would people feel
>         about moving to taking an explicit (pinned) dependency on
>         cloudpickle?
>         We could add cloudpickle to the and a
>         requirements.txt file for users who prefer not to do a system
>         installation of PySpark.
>         Py4J is maybe even a simpler case, we currently have a zip of
>         py4j in our repo but could instead have a pinned version
>         required. While we do depend on a lot of py4j internal APIs,
>         version pinning should be sufficient to ensure functionality
>         (and simplify the update process).
>         Cheers,
>         Holden :)
>         -- 
>         Twitter:
>         <>
> -- 
> Cell : 425-233-8271
> Twitter:

Maciej Szymkiewicz

View raw message