spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Erlandson <>
Subject Re: Python friendly API for Spark 3.0
Date Tue, 18 Sep 2018 15:35:19 GMT
I like the notion of empowering cross platform bindings.

The trend of computing frameworks seems to be that all APIs gradually
converge on a stable attractor which could be described as "data frames and
SQL"  Spark's early API design was RDD focused, but these days the center
of gravity is all about DataFrame (Python's prevalence combined with its
lack of a static type system substantially dilutes the benefits of DataSet,
for any library development that aspires to both JVM and python support).

I can imagine optimizing the developer layers of Spark APIs so that cross
platform support and also 3rd-party support for new and existing Spark
bindings would be maximized for "parallelizable dataframe+SQL"  Another of
Spark's strengths is it's ability to federate heterogeneous data sources,
and making cross platform bindings easy for that is desirable.

On Sun, Sep 16, 2018 at 1:02 PM, Mark Hamstra <>

> It's not splitting hairs, Erik. It's actually very close to something that
> I think deserves some discussion (perhaps on a separate thread.) What I've
> been thinking about also concerns API "friendliness" or style. The original
> RDD API was very intentionally modeled on the Scala parallel collections
> API. That made it quite friendly for some Scala programmers, but not as
> much so for users of the other language APIs when they eventually came
> about. Similarly, the Dataframe API drew a lot from pandas and R, so it is
> relatively friendly for those used to those abstractions. Of course, the
> Spark SQL API is modeled closely on HiveQL and standard SQL. The new
> barrier scheduling draws inspiration from MPI. With all of these models and
> sources of inspiration, as well as multiple language targets, there isn't
> really a strong sense of coherence across Spark -- I mean, even though one
> of the key advantages of Spark is the ability to do within a single
> framework things that would otherwise require multiple frameworks, actually
> doing that is requiring more than one programming style or multiple design
> abstractions more than what is strictly necessary even when writing Spark
> code in just a single language.
> For me, that raises questions over whether we want to start designing,
> implementing and supporting APIs that are designed to be more consistent,
> friendly and idiomatic to particular languages and abstractions -- e.g. an
> API covering all of Spark that is designed to look and feel as much like
> "normal" code for a Python programmer, another that looks and feels more
> like "normal" Java code, another for Scala, etc. That's a lot more work and
> support burden than the current approach where sometimes it feels like you
> are writing "normal" code for your prefered programming environment, and
> sometimes it feels like you are trying to interface with something foreign,
> but underneath it hopefully isn't too hard for those writing the
> implementation code below the APIs, and it is not too hard to maintain
> multiple language bindings that are each fairly lightweight.
> It's a cost-benefit judgement, of course, whether APIs that are heavier
> (in terms of implementing and maintaining) and friendlier (for end users)
> are worth doing, and maybe some of these "friendlier" APIs can be done
> outside of Spark itself (imo, Frameless is doing a very nice job for the
> parts of Spark that it is currently covering --
> typelevel/frameless); but what we have currently is a bit too ad hoc and
> fragmentary for my taste.
> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson <>
> wrote:
>> I am probably splitting hairs to finely, but I was considering the
>> difference between improvements to the jvm-side (py4j and the scala/java
>> code) that would make it easier to write the python layer ("python-friendly
>> api"), and actual improvements to the python layers ("friendly python api").
>> They're not mutually exclusive of course, and both worth working on. But
>> it's *possible* to improve either without the other.
>> Stub files look like a great solution for type annotations, maybe even if
>> only python 3 is supported.
>> I definitely agree that any decision to drop python 2 should not be taken
>> lightly. Anecdotally, I'm seeing an increase in python developers
>> announcing that they are dropping support for python 2 (and loving it). As
>> people have already pointed out, if we don't drop python 2 for spark 3.0,
>> we're stuck with it until 4.0, which would place spark in a
>> possibly-awkward position of supporting python 2 for some time after it
>> goes EOL.
>> Under the current release cadence, spark 3.0 will land some time in early
>> 2019, which at that point will be mere months until EOL for py2.
>> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau <>
>> wrote:
>>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson <>
>>> wrote:
>>>> To be clear, is this about "python-friendly API" or "friendly python
>>>> API" ?
>>> Well what would you consider to be different between those two
>>> statements? I think it would be good to be a bit more explicit, but I don't
>>> think we should necessarily limit ourselves.
>>>> On the python side, it might be nice to take advantage of static
>>>> typing. Requires python 3.6 but with python 2 going EOL, a spark-3.0 might
>>>> be a good opportunity to jump the python-3-only train.
>>> I think we can make types sort of work without ditching 2 (the types
>>> only would work in 3 but it would still function in 2). Ditching 2 entirely
>>> would be a big thing to consider, I honestly hadn't been considering that
>>> but it could be from just spending so much time maintaining a 2/3 code
>>> base. I'd suggest reaching out to to user@ before making that kind of
>>> change.
>>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau <>
>>>> wrote:
>>>>> Since we're talking about Spark 3.0 in the near future (and since some
>>>>> recent conversation on a proposed change reminded me) I wanted to open
>>>>> the floor and see if folks have any ideas on how we could make a more
>>>>> Python friendly API for 3.0? I'm planning on taking some time to look
>>>>> other systems in the solution space and see what we might want to learn
>>>>> from them but I'd love to hear what other folks are thinking too.
>>>>> --
>>>>> Twitter:
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> 2MaRAG9  <>
>>>>> YouTube Live Streams:

View raw message