spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Szymkiewicz <mszymkiew...@gmail.com>
Subject Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?
Date Tue, 14 Feb 2017 15:19:15 GMT
I don't have any strong views, so just to highlight possible issues:

  * Based on different issues I've seen there is a substantial amount of
    users which depend on system wide Python installations. As far as I
    am aware neither Py4j nor cloudpickle are present in the standard
    system repositories in Debian or Red Hat derivatives.
  * Assuming that Spark is committed to supporting Python 2 beyond its
    end of life we have to be sure that any external dependency has the
    same policy.
  * Py4j is missing from default Anaconda channel. Not a big issue, just
    a small annoyance.
  * External dependencies with pinned versions add some overhead to the
    development across versions (effectively we may need a separate env
    for each major Spark release). I've seen small inconsistencies in
    PySpark behavior with different Py4j versions so it is not
    completely hypothetical.
  * Adding possible version conflicts. It is probably not a big risk but
    something to consider (for example in combination Blaze + Dask +
    PySpark).
  * Adding another party user has to trust.


On 02/14/2017 12:22 AM, Holden Karau wrote:
> It's a good question. Py4J seems to have been updated 5 times in 2016
> and is a bit involved (from a review point of view verifying the zip
> file contents is somewhat tedious).
>
> cloudpickle is a bit difficult to tell since we can have changes to
> cloudpickle which aren't correctly tagged as backporting changes from
> the fork (and this can take awhile to review since we don't always
> catch them right away as being backports).
>
> Another difficulty with looking at backports is that since our review
> process for PySpark has historically been on the slow side, changes
> benefiting systems like dask or IPython parallel were not backported
> to Spark unless they caused serious errors.
>
> I think the key benefits are better test coverage of the forked
> version of cloudpickle, using a more standardized packaging of
> dependencies, simpler updates of dependencies reduces friction to
> gaining benefits from other related projects work - Python
> serialization really isn't our secret sauce.
>
> If I'm missing any substantial benefits or costs I'd love to know :)
>
> On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <rxin@databricks.com
> <mailto:rxin@databricks.com>> wrote:
>
>     With any dependency update (or refactoring of existing code), I
>     always ask this question: what's the benefit? In this case it
>     looks like the benefit is to reduce efforts in backports. Do you
>     know how often we needed to do those?
>
>
>     On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau
>     <holden@pigscanfly.ca <mailto:holden@pigscanfly.ca>> wrote:
>
>         Hi PySpark Developers,
>
>         Cloudpickle is a core part of PySpark, and is originally
>         copied from (and improved from) picloud. Since then other
>         projects have found cloudpickle useful and a fork of
>         cloudpickle <https://github.com/cloudpipe/cloudpickle> was
>         created and is now maintained as its own library
>         <https://pypi.python.org/pypi/cloudpickle> (with better test
>         coverage and resulting bug fixes I understand). We've had a
>         few PRs backporting fixes from the cloudpickle project into
>         Spark's local copy of cloudpickle - how would people feel
>         about moving to taking an explicit (pinned) dependency on
>         cloudpickle?
>
>         We could add cloudpickle to the setup.py and a
>         requirements.txt file for users who prefer not to do a system
>         installation of PySpark.
>
>         Py4J is maybe even a simpler case, we currently have a zip of
>         py4j in our repo but could instead have a pinned version
>         required. While we do depend on a lot of py4j internal APIs,
>         version pinning should be sufficient to ensure functionality
>         (and simplify the update process).
>
>         Cheers,
>
>         Holden :)
>
>         -- 
>         Twitter: https://twitter.com/holdenkarau
>         <https://twitter.com/holdenkarau>
>
>
>
>
>
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau

-- 
Maciej Szymkiewicz


Mime
View raw message