spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Driesprong, Fokko" <fo...@driesprong.frl>
Subject Re: [PySpark] Revisiting PySpark type annotations
Date Thu, 20 Aug 2020 10:14:25 GMT
Hi Maciej, Hyukjin,

Did you find any time to discuss adding the types to the Python repository?
Would love to know what came out of it.

Cheers, Fokko

Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko <fokko@driesprong.frl>:

> Mostly echoing stuff that we've discussed in
> https://github.com/apache/spark/pull/29180, but good to have this also on
> the dev-list.
>
> > So IMO maintaining outside in a separate repo is going to be harder.
> That was why I asked.
>
> I agree with Felix, having this inside of the project would make it much
> easier to maintain. Having it inside of the ASF might be easier to port the
> pyi files to the actual Spark repository.
>
> > FWIW, NumPy took this approach. they made a separate repo, and merged it
> into the main repo after it became stable.
>
> As Maciej pointed out:
>
> > As of POC ‒ we have stubs, which have been maintained over three years
> now and cover versions between 2.3 (though these are fairly limited) to,
> with some lag, current master.
>
> What would be required to mark it as stable?
>
> > I guess all depends on how we envision the future of annotations
> (including, but not limited to, how conservative we want to be in the
> future). Which is probably something that should be discussed here.
>
> I'm happy to motivate people to contribute type hints, and I believe it is
> a very accessible way to get more people involved in the Python codebase.
> Using the ASF model we can ensure that we require committers/PMC to sign
> off on the annotations.
>
> > Indeed, though the possible advantage is that in theory, you can have
> different release cycle than for the main repo (I am not sure if that's
> feasible in practice or if that was the intention).
>
> Personally, I don't think we need a different cycle if the type hints are
> part of the code itself.
>
> > If my understanding is correct, pyspark-stubs is still incomplete and
> does not annotate types in some other APIs (by using Any). Correct me if I
> am wrong, Maciej.
>
> For me, it is a bit like code coverage. You want this to be high to make
> sure that you cover most of the APIs, but it will take some time to make it
> complete.
>
> For me, it feels a bit like a chicken and egg problem. Because the type
> hints are in a separate repository, they will always lag behind. Also, it
> is harder to spot where the gaps are.
>
> Cheers, Fokko
>
>
>
> Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon <gurwls223@gmail.com>:
>
>> Oh I think I caused some confusion here.
>> Just for clarification, I wasn’t saying we must port this into a separate
>> repo now. I was saying it can be one of the options we can consider.
>>
>> For a bit of more context:
>> This option was considered as, roughly speaking, an invalid option and it
>> might need an incubation process as a separate project.
>> After some investigations, I found that this is still a valid option and
>> we can take this as the part of Apache Spark but in a separate repo.
>>
>> FWIW, NumPy took this approach. they made a separate repo
>> <https://github.com/numpy/numpy-stubs>, and merged it into the main repo
>> <https://github.com/numpy/numpy-stubs> after it became stable.
>>
>>
>> My only major concerns are:
>>
>>    - the possibility to fundamentally change the approach in
>>    pyspark-stubs <https://github.com/zero323/pyspark-stubs>. It’s not
>>    because how it was done is wrong but because how Python type hinting itself
>>    evolves.
>>    - If my understanding is correct, pyspark-stubs
>>    <https://github.com/zero323/pyspark-stubs> is still incomplete and
>>    does not annotate types in some other APIs (by using Any). Correct me if I
>>    am wrong, Maciej.
>>
>> I’ll have a short sync with him and share to understand better since he’d
>> probably know the context best in PySpark type hints and I know some
>> contexts in ASF and Apache Spark.
>>
>>
>>
>> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz <mszymkiewicz@gmail.com>님이
>> 작성:
>>
>>> Indeed, though the possible advantage is that in theory, you can have
>>> different release cycle than for the main repo (I am not sure if that's
>>> feasible in practice or if that was the intention).
>>>
>>> I guess all depends on how we envision the future of annotations
>>> (including, but not limited to, how conservative we want to be in the
>>> future). Which is probably something that should be discussed here.
>>> On 8/4/20 11:06 PM, Felix Cheung wrote:
>>>
>>> So IMO maintaining outside in a separate repo is going to be harder.
>>> That was why I asked.
>>>
>>>
>>>
>>> ------------------------------
>>> *From:* Maciej Szymkiewicz <mszymkiewicz@gmail.com>
>>> <mszymkiewicz@gmail.com>
>>> *Sent:* Tuesday, August 4, 2020 12:59 PM
>>> *To:* Sean Owen
>>> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau;
>>> Spark Dev List
>>> *Subject:* Re: [PySpark] Revisiting PySpark type annotations
>>>
>>>
>>> On 8/4/20 9:35 PM, Sean Owen wrote
>>> > Yes, but the general argument you make here is: if you tie this
>>> > project to the main project, it will _have_ to be maintained by
>>> > everyone. That's good, but also exactly I think the downside we want
>>> > to avoid at this stage (I thought?) I understand for some
>>> > undertakings, it's just not feasible to start outside the main
>>> > project, but is there no proof of concept even possible before taking
>>> > this step -- which more or less implies it's going to be owned and
>>> > merged and have to be maintained in the main project.
>>>
>>>
>>> I think we have a bit different understanding here ‒ I believe we have
>>> reached a conclusion that maintaining annotations within the project is
>>> OK, we only differ when it comes to specific form it should take.
>>>
>>> As of POC ‒ we have stubs, which have been maintained over three years
>>> now and cover versions between 2.3 (though these are fairly limited) to,
>>> with some lag, current master.  There is some evidence there are used in
>>> the wild
>>> (
>>> https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D
>>> ),
>>> there are a few contributors
>>> (https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
>>> least some use cases (https://stackoverflow.com/q/40163106/). So,
>>> subjectively speaking, it seems we're already beyond POC.
>>>
>>> --
>>> Best regards,
>>> Maciej Szymkiewicz
>>>
>>> Web: https://zero323.net
>>> Keybase: https://keybase.io/zero323
>>> Gigs: https://www.codementor.io/@zero323
>>> PGP: A30CEF0C31A501EC
>>>
>>>
>>> --
>>> Best regards,
>>> Maciej Szymkiewicz
>>>
>>> Web: https://zero323.net
>>> Keybase: https://keybase.io/zero323
>>> Gigs: https://www.codementor.io/@zero323
>>> PGP: A30CEF0C31A501EC
>>>
>>>

Mime
View raw message