spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <>
Subject Re: Python Spark Improvements (forked from Spark Improvement Proposals)
Date Mon, 31 Oct 2016 17:28:28 GMT
I've been working on some of the issues and I have an PR to help provide
pip installable PySpark - .

Also if anyone is interested in PySpark UDF performance there is a closed
PR that if we could maybe figure out a way to make it a little more light
weight I'd love to hear people's ideas ( ).

On Sun, Oct 16, 2016 at 8:52 PM, Tobi Bosede <> wrote:

> Right Jeff. Holden actually brought this thread up today on the user list
> in response to my email about UDAFs here https://www.mail-archive.
> com/ I have yet to hear what I can do
> as a work around besides switching to scala/java though...
> On Sun, Oct 16, 2016 at 10:07 PM, Jeff Zhang <> wrote:
>> Thanks Holden to start this thread. I agree that spark devs should put
>> more efforts on pyspark but the reality is that we are a little slow on
>> this perspective . Since pyspark is to integrate spark into python, so I
>> think the focus is on the usability of pyspark. We should hear more
>> feedback from python community. And lots of data scientists love python, if
>> we want to more adoption of spark, then we should spend more time on
>> pyspark.
>> On Thu, Oct 13, 2016 at 9:59 AM, Nicholas Chammas <
>>> wrote:
>> I'd add one item to this list: The lack of Python 3 support in Spark
>> Packages <>.
>> This means that great packages like GraphFrames cannot be used with
>> Python 3 <>.
>> This is quite disappointing since Spark itself supports Python 3 and
>> since -- at least in my circles -- Python 3 adoption is reaching a tipping
>> point. All new Python projects at my company and at my friends' companies
>> are being written in Python 3.
>> Nick
>> On Wed, Oct 12, 2016 at 3:52 PM Holden Karau <>
>> wrote:
>> Hi Spark Devs & Users,
>> Forking off from Cody’s original thread
>> <>
>> of Spark Improvements, and Matei's follow up on asking what issues the
>> Python community was facing with Spark, I think it would be useful for us
>> to discuss some of the motivations behind some of the Python community
>> looking at different technologies to replace Apache Spark with. My
>> viewpoints are based that of a developer who works on Apache Spark
>> day-to-day <>, but also gives a fair number of talks
>> at Python conferences
>> <>
>> and I feel many (but not all) of the same challenges as the Python
>> community does trying to use Spark. I’ve included both the user@ and dev@
>> lists on this one since I think the user community can probably provide
>> more reasons why they have difficulty with PySpark. I should also point out
>> - the solution for all of these things may not live inside of the Spark
>> project itself, but it still impacts our usability as a whole.
>>    -
>>    Lack of pip installability
>> This is one of the points that Matei mentioned, and it something several
>> people have tried to provide for Spark in one way or another. It seems
>> getting reviewer time for this issue is rather challenging, and I’ve been
>> hesitant to ask the contributors to keep updating their PRs (as much as I
>> want to see some progress) because I just don't know if we have the time or
>> interest in this. I’m happy to pick up the latest from Juliet and try and
>> carry it over the finish line if we can find some committer time to work on
>> this since it now sounds like there is consensus we should do this.
>>    -
>>    Difficulty using PySpark from outside of spark-submit / pyspark shell
>> The FindSpark <> package needing
>> to exist is one of the clearest examples of this challenge. There is also a
>> PR to make it easier for other shells to extend the Spark shell, and we ran
>> into some similar challenges while working on Sparkling Pandas. This could
>> be solved by making Spark pip installable so I won't’ say too much about
>> this point.
>>    -
>>    Minimal integration with IPython/IJupyter
>> This one is awkward since one of the areas that some of the larger
>> commercial players work in effectively “competes” (in a very loose term)
>> with any features introduced around here. I’m not really super sure what
>> the best path forward is here, but I think collaborating with the IJupyter
>> people to enable more features found in the commercial offerings in open
>> source could be beneficial to everyone in the community, and maybe even
>> reduce the maintenance cost for some of the commercial entities. I
>> understand this is a tricky issue but having good progress indicators or
>> something similar could make a huge difference. (Note that Apache Toree
>> <> [Incubating] exists for Scala
>> users but hopefully the PySpark IJupyter integration could be achieved
>> without a new kernel).
>>    -
>>    Lack of virtualenv and or Python package distribution support
>> This one is also tricky since many commercial providers have their own
>> “solution” to this, but there isn’t a good story around supporting custom
>> virtual envs or user required Python packages. While spark-packages _can_
>> be Python this requires that the Python package developer go through rather
>> a lot of work to make their package available and realistically won’t
>> happen for most Python packages people want to use. And to be fair, the
>> addFiles mechanism does support Python eggs which works for some packages.
>> There are some outstanding PRs around this issue (and I understand these
>> are perhaps large issues which might require large changes to the current
>> suggested implementations - I’ve had difficulty keeping the current set of
>> open PRs around this straight in my own head) but there seems to be no
>> committer bandwidth or interest on working with the contributors who have
>> suggested these things. Is this an intentional decision or is this
>> something we as a community are willing to work on/tackle?
>>    -
>>    Speed/performance
>> This is often a complaint I hear from more “data engineering” profile
>> users who are working in Python. These problems come mostly in places
>> involving the interaction of Python and the JVM (so UDFs, transformations
>> with arbitrary lambdas, collect() and toPandas()). This is an area I’m
>> working on (see ) and
>> hopefully we can start investigating Apache Arrow
>> <> to speed up the bridge (or something
>> similar) once it’s a bit more ready (currently Arrow just released 0.1
>> which is exciting). We also probably need to start measuring these things
>> more closely since otherwise random regressions will continue to be
>> introduced (like the challenge with unbalanced partitions and block
>> serialization together - see SPARK-17817
>> <> which fixed this)
>>    -
>>    Configuration difficulties (especially related to OOMs)
>> This is a general challenge many people face working in Spark, but
>> PySpark users are also asked to somehow figure out what the correct amount
>> of memory is to give to the Python process versus the Scala/JVM processes.
>> This was maybe an acceptable solution at the start, but when combined with
>> the difficult to understand error messages it can become quite the time
>> sink. A quick work around would be picking a different default overhead for
>> applications using Python, but more generally hopefully some shared off-JVM
>> heap solution could also help reduce this challenge in the future.
>>    -
>>    API difficulties
>> The Spark API doesn’t “feel” very Pythony is a complaint some people
>> have, but I think we’ve done some excellent work in the DataFrame/Dataset
>> API here. At the same time we’ve made some really frustrating choices with
>> the DataFrame API (e.g. removing map from DataFrames pre-emptively even
>> when we have no concrete plans to bring the Dataset API to PySpark).
>> A lot of users wish that our DataFrame API was more like the Pandas API
>> (and Wes has pointed out on some JIRAs where we have differences) as well
>> as covered more of the functionality of Pandas. This is a hard problem, and
>> it the solution might not belong inside of PySpark itself (Juliet and I did
>> some proof-of-concept work back in the day on Sparkling Pandas
>> <>) - but since one of
>> my personal goals has been trying to become a committer I’ve been more
>> focused on contributing to Spark itself rather than libraries and very few
>> people seem to be interested in working on this project [although I still
>> have potential users ask if they can use it]. (Of course if there is
>> sufficient interest to reboot Sparkling Pandas or something similar that
>> would be an interesting area of work - but it’s also a huge area of work -
>> if you look at Dask <>, a good portion of the
>> work is dedicated just to supporting pandas like operations).
>>    -
>>    Incomprehensible error messages
>> I often have people ask me how to debug PySpark and they often have a
>> certain haunted look in their eyes while they ask me this (slightly
>> joking). More seriously, we really need to provide more guidance around how
>> to understand PySpark error messages and look at figuring out if there are
>> places where we can improve the messaging so users aren’t hunting through
>> stack overflow trying to figure out where the Java exception they are
>> getting is related to their Python code. In one talk I gave recently
>> someone mentioned PySpark was the motivation behind finding the hide error
>> messages plugin/settings for IJupyter.
>>    -
>>    Lack of useful ML model & pipeline export/import
>> This is something we’ve made great progress on, many of the PySpark
>> models are now able to use the underlying export mechanisms from Java.
>> However I often hear challenges with using these models in the rest of the
>> Python space once they have been exported from Spark. I’ve got a PR to add
>> basic PMML export in Scala to ML (which we can then bring to Python), but I
>> think the Python community is open to other formats if the Spark community
>> doesn’t want to go the PMML route.
>> Now I don’t think we will see the same challenges we’ve seen develop in
>> the R community, but I suspect purely Python approaches to distributed
>> systems will continue to eat the “low end” of Spark (e.g. medium sized data
>> problems requiring parallelism). This isn’t necessarily a bad thing, but if
>> there is anything I’ve learnt it that's the "low end" solution often
>> quickly eats the "high end" within a few years - and I’d rather see Spark
>> continue to thrive outside of the pure JVM space.
>> These are just the biggest issues that I hear come up commonly and
>> remembered on my flight back - it’s quite possible I’ve missed important
>> things. I know contributing on a mailing list can be scary or intimidating
>> for new users (and even experienced developers may wish to stay out of
>> discussions they view as heated) - but I strongly encourage everyone to
>> participate (respectfully) in this thread and we can all work together to
>> help Spark continue to be the place where people from different languages
>> and backgrounds continue to come together to collaborate.
>> I want to be clear as well, while I feel these are all very important
>> issues (and being someone who has worked on PySpark & Spark for years
>> <> without being a committer I may sometimes come
>> off as frustrated when I talk about these) I think PySpark as a whole is a
>> really excellent application and we do some really awesome stuff with it.
>> There are also things that I will be blind to as a result of having worked
>> on Spark for so long (for example yesterday I caught myself using the _
>> syntax in a Scala example without explaining it because it seems “normal”
>> to me but often trips up new comers.) If we can address even some of these
>> issues I believe it will be a huge win for Spark adoption outside of the
>> traditional JVM space (and as the various community surveys continue to
>> indicate PySpark usage is already quite high).
>> Normally I’d bring in a box of timbits
>> <>/doughnuts or something if we
>> were having an in-person meeting for this but all I can do for the mailing
>> list is attach a cute picture and offer future doughnuts/coffee if people
>> want to chat IRL. So, in closing, I’ve included a picture of two of my
>> stuffed animals working on Spark on my flight back from a Python Data
>> conference & Spark meetup just to remind everyone that this is just a
>> software project and we can be friendly nice people if we try and things
>> will be much more awesome if we do :)
>> [image: image.png]
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter:
>> --
>> Best Regards
>> Jeff Zhang

Cell : 425-233-8271

View raw message