spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stavros Kontopoulos <stavros.kontopou...@lightbend.com>
Subject Re: Python friendly API for Spark 3.0
Date Sat, 29 Sep 2018 21:17:13 GMT
Regarding Python 3.x upgrade referenced earlier. Some people already gone
down that path of upgrading:

https://blogs.dropbox.com/tech/2018/09/how-we-rolled-out-one-of-the-largest-python-3-migrations-ever

They describe some good reasons.

Stavros

On Tue, Sep 18, 2018 at 6:35 PM, Erik Erlandson <eerlands@redhat.com> wrote:

> I like the notion of empowering cross platform bindings.
>
> The trend of computing frameworks seems to be that all APIs gradually
> converge on a stable attractor which could be described as "data frames and
> SQL"  Spark's early API design was RDD focused, but these days the center
> of gravity is all about DataFrame (Python's prevalence combined with its
> lack of a static type system substantially dilutes the benefits of DataSet,
> for any library development that aspires to both JVM and python support).
>
> I can imagine optimizing the developer layers of Spark APIs so that cross
> platform support and also 3rd-party support for new and existing Spark
> bindings would be maximized for "parallelizable dataframe+SQL"  Another of
> Spark's strengths is it's ability to federate heterogeneous data sources,
> and making cross platform bindings easy for that is desirable.
>
>
> On Sun, Sep 16, 2018 at 1:02 PM, Mark Hamstra <mark@clearstorydata.com>
> wrote:
>
>> It's not splitting hairs, Erik. It's actually very close to something
>> that I think deserves some discussion (perhaps on a separate thread.) What
>> I've been thinking about also concerns API "friendliness" or style. The
>> original RDD API was very intentionally modeled on the Scala parallel
>> collections API. That made it quite friendly for some Scala programmers,
>> but not as much so for users of the other language APIs when they
>> eventually came about. Similarly, the Dataframe API drew a lot from pandas
>> and R, so it is relatively friendly for those used to those abstractions.
>> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL.
>> The new barrier scheduling draws inspiration from MPI. With all of these
>> models and sources of inspiration, as well as multiple language targets,
>> there isn't really a strong sense of coherence across Spark -- I mean, even
>> though one of the key advantages of Spark is the ability to do within a
>> single framework things that would otherwise require multiple frameworks,
>> actually doing that is requiring more than one programming style or
>> multiple design abstractions more than what is strictly necessary even when
>> writing Spark code in just a single language.
>>
>> For me, that raises questions over whether we want to start designing,
>> implementing and supporting APIs that are designed to be more consistent,
>> friendly and idiomatic to particular languages and abstractions -- e.g. an
>> API covering all of Spark that is designed to look and feel as much like
>> "normal" code for a Python programmer, another that looks and feels more
>> like "normal" Java code, another for Scala, etc. That's a lot more work and
>> support burden than the current approach where sometimes it feels like you
>> are writing "normal" code for your prefered programming environment, and
>> sometimes it feels like you are trying to interface with something foreign,
>> but underneath it hopefully isn't too hard for those writing the
>> implementation code below the APIs, and it is not too hard to maintain
>> multiple language bindings that are each fairly lightweight.
>>
>> It's a cost-benefit judgement, of course, whether APIs that are heavier
>> (in terms of implementing and maintaining) and friendlier (for end users)
>> are worth doing, and maybe some of these "friendlier" APIs can be done
>> outside of Spark itself (imo, Frameless is doing a very nice job for the
>> parts of Spark that it is currently covering --
>> https://github.com/typelevel/frameless); but what we have currently is a
>> bit too ad hoc and fragmentary for my taste.
>>
>> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson <eerlands@redhat.com>
>> wrote:
>>
>>> I am probably splitting hairs to finely, but I was considering the
>>> difference between improvements to the jvm-side (py4j and the scala/java
>>> code) that would make it easier to write the python layer ("python-friendly
>>> api"), and actual improvements to the python layers ("friendly python api").
>>>
>>> They're not mutually exclusive of course, and both worth working on. But
>>> it's *possible* to improve either without the other.
>>>
>>> Stub files look like a great solution for type annotations, maybe even
>>> if only python 3 is supported.
>>>
>>> I definitely agree that any decision to drop python 2 should not be
>>> taken lightly. Anecdotally, I'm seeing an increase in python developers
>>> announcing that they are dropping support for python 2 (and loving it). As
>>> people have already pointed out, if we don't drop python 2 for spark 3.0,
>>> we're stuck with it until 4.0, which would place spark in a
>>> possibly-awkward position of supporting python 2 for some time after it
>>> goes EOL.
>>>
>>> Under the current release cadence, spark 3.0 will land some time in
>>> early 2019, which at that point will be mere months until EOL for py2.
>>>
>>> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau <holden@pigscanfly.ca>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson <eerlands@redhat.com>
>>>> wrote:
>>>>
>>>>> To be clear, is this about "python-friendly API" or "friendly python
>>>>> API" ?
>>>>>
>>>> Well what would you consider to be different between those two
>>>> statements? I think it would be good to be a bit more explicit, but I don't
>>>> think we should necessarily limit ourselves.
>>>>
>>>>>
>>>>> On the python side, it might be nice to take advantage of static
>>>>> typing. Requires python 3.6 but with python 2 going EOL, a spark-3.0
might
>>>>> be a good opportunity to jump the python-3-only train.
>>>>>
>>>> I think we can make types sort of work without ditching 2 (the types
>>>> only would work in 3 but it would still function in 2). Ditching 2 entirely
>>>> would be a big thing to consider, I honestly hadn't been considering that
>>>> but it could be from just spending so much time maintaining a 2/3 code
>>>> base. I'd suggest reaching out to to user@ before making that kind of
>>>> change.
>>>>
>>>>>
>>>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau <holden@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Since we're talking about Spark 3.0 in the near future (and since
>>>>>> some recent conversation on a proposed change reminded me) I wanted
to open
>>>>>> up the floor and see if folks have any ideas on how we could make
a more
>>>>>> Python friendly API for 3.0? I'm planning on taking some time to
look at
>>>>>> other systems in the solution space and see what we might want to
learn
>>>>>> from them but I'd love to hear what other folks are thinking too.
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>
>

Mime
View raw message