spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <felixcheun...@hotmail.com>
Subject Re: [PySpark] Revisiting PySpark type annotations
Date Tue, 04 Aug 2020 16:44:35 GMT
What would be the reason for separate git repo?

________________________________
From: Hyukjin Kwon <gurwls223@gmail.com>
Sent: Monday, August 3, 2020 1:58:55 AM
To: Maciej Szymkiewicz <mszymkiewicz@gmail.com>
Cc: Driesprong, Fokko <fokko@driesprong.frl>; Holden Karau <holden@pigscanfly.ca>;
Spark Dev List <dev@spark.apache.org>
Subject: Re: [PySpark] Revisiting PySpark type annotations

Okay, seems like we can create a separate repo as apache/spark? e.g.) https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we discussed offline.


2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz <mszymkiewicz@gmail.com<mailto:mszymkiewicz@gmail.com>>님이
작성:


W dniu środa, 22 lipca 2020 Driesprong, Fokko <fokko@driesprong.frl> napisał(a):
That's probably one-time overhead so it is not a big issue.  In my opinion, a bigger one is
possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase.
This can be addressed, but don't look great.

This is not true (anymore). With Python 3.6 you can add string annotations -> 'DenseVector',
and in the future with Python 3.7 this is fixed by having postponed evaluation: https://www.python.org/dev/peps/pep-0563/

As far as I recall linked PEP addresses backrferences not cyclic dependencies, which weren't
a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context depends on pyspark.rdd
and the other way around. These dependencies are not explicit at he moment.


Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file. In my opinion
you want to have the signatures and the functions together for transparency and maintainability.


I guess that's the matter of preference. From maintainability perspective it is actually much
easier to have separate objects.

For example there are different types of objects that are required for meaningful checking,
which don't really exist in real code (protocols, aliases, code generated signatures fo let
complex overloads) as well as some monkey patched entities

Additionally it is often easier to see inconsistencies when typing is separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and keep it in-sync there
with common CI pipeline and transfer ownership of pypi package to ASF
- Move stubs directly into python/pyspark and then apply individual stubs to .modules of choice.

Of course, the first proposal could be an initial step for the latter one.


I think DBT is a very nice project where they use annotations very well: https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the annotations
itself.


In practice, the biggest advantage is actually support for completion, not type checking (which
works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a Spark community
if we want to add the annotations to the code, and in which extend.




At some point (in general when things are heavy in generics, which is the case here), annotations
become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the function/code
:)

That might the case, but it is more often a matter capturing useful properties combined with
requirement to keep things in sync with Scala counterparts.


For now, I tend to think adding type hints to the codes make it difficult to backport or revert
and more difficult to discuss about typing only especially considering typing is arguably
premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you provide different
stubs for different versions of Python? I had to look up the literals: https://www.python.org/dev/peps/pep-0586/

I think it is more about portability between Spark versions


Cheers, Fokko

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <mszymkiewicz@gmail.com<mailto:mszymkiewicz@gmail.com>>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




--

Best regards,
Maciej Szymkiewicz

Mime
View raw message