spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Szymkiewicz <mszymkiew...@gmail.com>
Subject Re: [PySpark] Revisiting PySpark type annotations
Date Tue, 04 Aug 2020 19:32:06 GMT
*First of all why ASF ownership? *

For the project of this size maintaining high quality (it is not hard to
use stubgen or monkeytype, but resulting annotations are rather
simplistic) annotations independent of the actual codebase is far from
trivial. For starters, changes which are mostly transparent to the final
user (like pyspark.ml changes in 3.0 / 3.1) might require significant
changes in the annotations. Additionally some signature changes are
rather hard to track and such separation can easily lead to divergence.

Additionally, annotations are as much about describing facts, as showing
intended usage (the simplest use case is documenting argument
dependencies). This makes process of annotation rather subjective and
requires good understanding of author's intention.

Finally, annotation-friendly signatures require conscious decisions (see
for example https://github.com/python/mypy/issues/5621).

Overall, ASF ownership is probably the best way to ensure long-term
sustainability and quality of annotations.

*Now, why separate repo?*

Based on the discussion so far it is clear that there is no consensus
about using inline annotations. There are three other options:

  * Stub files packaged alongside actual code.
  * Separate project within root, packaged separately.
  * Separate repository, packaged separately.

As already pointed out here and in the comments to
https://github.com/apache/spark/pull/29180, annotations are still
somewhat unstable. Ecosystem evolves quickly and new features, some
having potential for fundamental change in the way how we annotate code.

Therefore, it might be beneficial to maintain subproject (out of lack of
a better word), that can evolve faster than the code that is annotate.

While I have no strong opinion about this part, it is definitely a
relatively unobtrusive way of bringing code and annotations closer
together.

On 8/4/20 7:44 PM, Sean Owen wrote:

> Maybe more specifically, why an ASF repo?
>
> On Tue, Aug 4, 2020 at 11:45 AM Felix Cheung <felixcheung_m@hotmail.com> wrote:
>> What would be the reason for separate git repo?
>>
>> ________________________________
>> From: Hyukjin Kwon <gurwls223@gmail.com>
>> Sent: Monday, August 3, 2020 1:58:55 AM
>> To: Maciej Szymkiewicz <mszymkiewicz@gmail.com>
>> Cc: Driesprong, Fokko <fokko@driesprong.frl>; Holden Karau <holden@pigscanfly.ca>;
Spark Dev List <dev@spark.apache.org>
>> Subject: Re: [PySpark] Revisiting PySpark type annotations
>>
>> Okay, seems like we can create a separate repo as apache/spark? e.g.) https://issues.apache.org/jira/browse/INFRA-20470
>> We can also think about porting the files as are.
>> I will try to have a short sync with the author Maciej, and share what we discussed
offline.
>>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC


Mime
View raw message