spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricardo Almeida <ricardo.alme...@actnowib.com>
Subject Re: Python Spark Improvements (forked from Spark Improvement Proposals)
Date Wed, 12 Oct 2016 21:09:02 GMT
I would add to the list the lag between Scala and Python for
new released features. Some features/functions get implemented later for
Pyspark, others not available at all. Think GraphX (maybe not the best
example), usually mentioned as one of the main libraries, that didn't make
it to the Python API (and never will - fortunately GraphFrames came to the
rescue on this particular case).

On 12 October 2016 at 21:49, Holden Karau <holden@pigscanfly.ca> wrote:

> Hi Spark Devs & Users,
>
>
> Forking off from Cody’s original thread
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html#none>
> of Spark Improvements, and Matei's follow up on asking what issues the
> Python community was facing with Spark, I think it would be useful for us
> to discuss some of the motivations behind some of the Python community
> looking at different technologies to replace Apache Spark with. My
> viewpoints are based that of a developer who works on Apache Spark
> day-to-day <http://bit.ly/hkspmg>, but also gives a fair number of talks
> at Python conferences
> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=holden+karau+pydata&start=0>
> and I feel many (but not all) of the same challenges as the Python
> community does trying to use Spark. I’ve included both the user@ and dev@
> lists on this one since I think the user community can probably provide
> more reasons why they have difficulty with PySpark. I should also point out
> - the solution for all of these things may not live inside of the Spark
> project itself, but it still impacts our usability as a whole.
>
>
>    -
>
>    Lack of pip installability
>
> This is one of the points that Matei mentioned, and it something several
> people have tried to provide for Spark in one way or another. It seems
> getting reviewer time for this issue is rather challenging, and I’ve been
> hesitant to ask the contributors to keep updating their PRs (as much as I
> want to see some progress) because I just don't know if we have the time or
> interest in this. I’m happy to pick up the latest from Juliet and try and
> carry it over the finish line if we can find some committer time to work on
> this since it now sounds like there is consensus we should do this.
>
>    -
>
>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>
> The FindSpark <https://pypi.python.org/pypi/findspark> package needing to
> exist is one of the clearest examples of this challenge. There is also a PR
> to make it easier for other shells to extend the Spark shell, and we ran
> into some similar challenges while working on Sparkling Pandas. This could
> be solved by making Spark pip installable so I won't’ say too much about
> this point.
>
>    -
>
>    Minimal integration with IPython/IJupyter
>
> This one is awkward since one of the areas that some of the larger
> commercial players work in effectively “competes” (in a very loose term)
> with any features introduced around here. I’m not really super sure what
> the best path forward is here, but I think collaborating with the IJupyter
> people to enable more features found in the commercial offerings in open
> source could be beneficial to everyone in the community, and maybe even
> reduce the maintenance cost for some of the commercial entities. I
> understand this is a tricky issue but having good progress indicators or
> something similar could make a huge difference. (Note that Apache Toree
> <https://toree.incubator.apache.org/> [Incubating] exists for Scala users
> but hopefully the PySpark IJupyter integration could be achieved without a
> new kernel).
>
>    -
>
>    Lack of virtualenv and or Python package distribution support
>
> This one is also tricky since many commercial providers have their own
> “solution” to this, but there isn’t a good story around supporting custom
> virtual envs or user required Python packages. While spark-packages _can_
> be Python this requires that the Python package developer go through rather
> a lot of work to make their package available and realistically won’t
> happen for most Python packages people want to use. And to be fair, the
> addFiles mechanism does support Python eggs which works for some packages.
> There are some outstanding PRs around this issue (and I understand these
> are perhaps large issues which might require large changes to the current
> suggested implementations - I’ve had difficulty keeping the current set of
> open PRs around this straight in my own head) but there seems to be no
> committer bandwidth or interest on working with the contributors who have
> suggested these things. Is this an intentional decision or is this
> something we as a community are willing to work on/tackle?
>
>    -
>
>    Speed/performance
>
> This is often a complaint I hear from more “data engineering” profile
> users who are working in Python. These problems come mostly in places
> involving the interaction of Python and the JVM (so UDFs, transformations
> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
> working on (see https://github.com/apache/spark/pull/13571 ) and
> hopefully we can start investigating Apache Arrow
> <https://arrow.apache.org/> to speed up the bridge (or something similar)
> once it’s a bit more ready (currently Arrow just released 0.1 which is
> exciting). We also probably need to start measuring these things more
> closely since otherwise random regressions will continue to be introduced
> (like the challenge with unbalanced partitions and block serialization
> together - see SPARK-17817
> <https://issues.apache.org/jira/browse/SPARK-17817> which fixed this)
>
>    -
>
>    Configuration difficulties (especially related to OOMs)
>
> This is a general challenge many people face working in Spark, but PySpark
> users are also asked to somehow figure out what the correct amount of
> memory is to give to the Python process versus the Scala/JVM processes.
> This was maybe an acceptable solution at the start, but when combined with
> the difficult to understand error messages it can become quite the time
> sink. A quick work around would be picking a different default overhead for
> applications using Python, but more generally hopefully some shared off-JVM
> heap solution could also help reduce this challenge in the future.
>
>    -
>
>    API difficulties
>
> The Spark API doesn’t “feel” very Pythony is a complaint some people have,
> but I think we’ve done some excellent work in the DataFrame/Dataset API
> here. At the same time we’ve made some really frustrating choices with the
> DataFrame API (e.g. removing map from DataFrames pre-emptively even when we
> have no concrete plans to bring the Dataset API to PySpark).
>
> A lot of users wish that our DataFrame API was more like the Pandas API
> (and Wes has pointed out on some JIRAs where we have differences) as well
> as covered more of the functionality of Pandas. This is a hard problem, and
> it the solution might not belong inside of PySpark itself (Juliet and I did
> some proof-of-concept work back in the day on Sparkling Pandas
> <https://github.com/sparklingpandas/sparklingpandas>) - but since one of
> my personal goals has been trying to become a committer I’ve been more
> focused on contributing to Spark itself rather than libraries and very few
> people seem to be interested in working on this project [although I still
> have potential users ask if they can use it]. (Of course if there is
> sufficient interest to reboot Sparkling Pandas or something similar that
> would be an interesting area of work - but it’s also a huge area of work -
> if you look at Dask <http://dask.pydata.org/>, a good portion of the work
> is dedicated just to supporting pandas like operations).
>
>    -
>
>    Incomprehensible error messages
>
> I often have people ask me how to debug PySpark and they often have a
> certain haunted look in their eyes while they ask me this (slightly
> joking). More seriously, we really need to provide more guidance around how
> to understand PySpark error messages and look at figuring out if there are
> places where we can improve the messaging so users aren’t hunting through
> stack overflow trying to figure out where the Java exception they are
> getting is related to their Python code. In one talk I gave recently
> someone mentioned PySpark was the motivation behind finding the hide error
> messages plugin/settings for IJupyter.
>
>    -
>
>    Lack of useful ML model & pipeline export/import
>
> This is something we’ve made great progress on, many of the PySpark models
> are now able to use the underlying export mechanisms from Java. However I
> often hear challenges with using these models in the rest of the Python
> space once they have been exported from Spark. I’ve got a PR to add basic
> PMML export in Scala to ML (which we can then bring to Python), but I think
> the Python community is open to other formats if the Spark community
> doesn’t want to go the PMML route.
>
> Now I don’t think we will see the same challenges we’ve seen develop in
> the R community, but I suspect purely Python approaches to distributed
> systems will continue to eat the “low end” of Spark (e.g. medium sized data
> problems requiring parallelism). This isn’t necessarily a bad thing, but if
> there is anything I’ve learnt it that's the "low end" solution often
> quickly eats the "high end" within a few years - and I’d rather see Spark
> continue to thrive outside of the pure JVM space.
>
> These are just the biggest issues that I hear come up commonly and
> remembered on my flight back - it’s quite possible I’ve missed important
> things. I know contributing on a mailing list can be scary or intimidating
> for new users (and even experienced developers may wish to stay out of
> discussions they view as heated) - but I strongly encourage everyone to
> participate (respectfully) in this thread and we can all work together to
> help Spark continue to be the place where people from different languages
> and backgrounds continue to come together to collaborate.
>
>
> I want to be clear as well, while I feel these are all very important
> issues (and being someone who has worked on PySpark & Spark for years
> <http://bit.ly/hkspmg> without being a committer I may sometimes come off
> as frustrated when I talk about these) I think PySpark as a whole is a
> really excellent application and we do some really awesome stuff with it.
> There are also things that I will be blind to as a result of having worked
> on Spark for so long (for example yesterday I caught myself using the _
> syntax in a Scala example without explaining it because it seems “normal”
> to me but often trips up new comers.) If we can address even some of these
> issues I believe it will be a huge win for Spark adoption outside of the
> traditional JVM space (and as the various community surveys continue to
> indicate PySpark usage is already quite high).
>
> Normally I’d bring in a box of timbits
> <https://en.wikipedia.org/wiki/Timbits>/doughnuts or something if we were
> having an in-person meeting for this but all I can do for the mailing list
> is attach a cute picture and offer future doughnuts/coffee if people want
> to chat IRL. So, in closing, I’ve included a picture of two of my stuffed
> animals working on Spark on my flight back from a Python Data conference &
> Spark meetup just to remind everyone that this is just a software project
> and we can be friendly nice people if we try and things will be much more
> awesome if we do :)
> [image: Inline image 1]
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Mime
View raw message