spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Driesprong, Fokko" <fo...@driesprong.frl>
Subject Re: [PySpark] Revisiting PySpark type annotations
Date Thu, 20 Aug 2020 11:38:53 GMT
No worries, thanks for the update!

Op do 20 aug. 2020 om 12:50 schreef Hyukjin Kwon <gurwls223@gmail.com>

> Yeah, we had a short meeting. I had to check a few other things so some
> delays happened. I will share soon.
>
> 2020년 8월 20일 (목) 오후 7:14, Driesprong, Fokko <fokko@driesprong.frl>님이
작성:
>
>> Hi Maciej, Hyukjin,
>>
>> Did you find any time to discuss adding the types to the Python
>> repository? Would love to know what came out of it.
>>
>> Cheers, Fokko
>>
>> Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko <fokko@driesprong.frl
>> >:
>>
>>> Mostly echoing stuff that we've discussed in
>>> https://github.com/apache/spark/pull/29180, but good to have this also
>>> on the dev-list.
>>>
>>> > So IMO maintaining outside in a separate repo is going to be harder.
>>> That was why I asked.
>>>
>>> I agree with Felix, having this inside of the project would make it much
>>> easier to maintain. Having it inside of the ASF might be easier to port the
>>> pyi files to the actual Spark repository.
>>>
>>> > FWIW, NumPy took this approach. they made a separate repo, and merged
>>> it into the main repo after it became stable.
>>>
>>> As Maciej pointed out:
>>>
>>> > As of POC ‒ we have stubs, which have been maintained over three years
>>> now and cover versions between 2.3 (though these are fairly limited) to,
>>> with some lag, current master.
>>>
>>> What would be required to mark it as stable?
>>>
>>> > I guess all depends on how we envision the future of annotations
>>> (including, but not limited to, how conservative we want to be in the
>>> future). Which is probably something that should be discussed here.
>>>
>>> I'm happy to motivate people to contribute type hints, and I believe it
>>> is a very accessible way to get more people involved in the Python
>>> codebase. Using the ASF model we can ensure that we require committers/PMC
>>> to sign off on the annotations.
>>>
>>> > Indeed, though the possible advantage is that in theory, you can have
>>> different release cycle than for the main repo (I am not sure if that's
>>> feasible in practice or if that was the intention).
>>>
>>> Personally, I don't think we need a different cycle if the type
>>> hints are part of the code itself.
>>>
>>> > If my understanding is correct, pyspark-stubs is still incomplete and
>>> does not annotate types in some other APIs (by using Any). Correct me if I
>>> am wrong, Maciej.
>>>
>>> For me, it is a bit like code coverage. You want this to be high to make
>>> sure that you cover most of the APIs, but it will take some time to make it
>>> complete.
>>>
>>> For me, it feels a bit like a chicken and egg problem. Because the type
>>> hints are in a separate repository, they will always lag behind. Also, it
>>> is harder to spot where the gaps are.
>>>
>>> Cheers, Fokko
>>>
>>>
>>>
>>> Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon <gurwls223@gmail.com>:
>>>
>>>> Oh I think I caused some confusion here.
>>>> Just for clarification, I wasn’t saying we must port this into a
>>>> separate repo now. I was saying it can be one of the options we can
>>>> consider.
>>>>
>>>>
>>>> For a bit of more context:
>>>> This option was considered as, roughly speaking, an invalid option and
>>>> it might need an incubation process as a separate project.
>>>> After some investigations, I found that this is still a valid option
>>>> and we can take this as the part of Apache Spark but in a separate repo.
>>>>
>>>>
>>>> FWIW, NumPy took this approach. they made a separate repo
>>>> <https://github.com/numpy/numpy-stubs>, and merged it into the main
>>>> repo <https://github.com/numpy/numpy-stubs> after it became stable.
>>>>
>>>>
>>>>
>>>> My only major concerns are:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>    - the possibility to fundamentally change the approach in
>>>>    pyspark-stubs <https://github.com/zero323/pyspark-stubs>. It’s
not
>>>>    because how it was done is wrong but because how Python type hinting itself
>>>>    evolves.
>>>>
>>>>    - If my understanding is correct, pyspark-stubs
>>>>    <https://github.com/zero323/pyspark-stubs> is still incomplete and
>>>>    does not annotate types in some other APIs (by using Any). Correct me
if I
>>>>    am wrong, Maciej.
>>>>
>>>>
>>>>
>>>>
>>>> I’ll have a short sync with him and share to understand better since
>>>> he’d probably know the context best in PySpark type hints and I know some
>>>> contexts in ASF and Apache Spark.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz <mszymkiewicz@gmail.com>님이
>>>> 작성:
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Indeed, though the possible advantage is that in theory, you can
>>>>>
>>>>> have different release cycle than for the main repo (I am not sure
>>>>>
>>>>> if that's feasible in practice or if that was the intention).
>>>>>
>>>>>
>>>>> I guess all depends on how we envision the future of annotations
>>>>>
>>>>> (including, but not limited to, how conservative we want to be in
>>>>>
>>>>> the future). Which is probably something that should be discussed
>>>>>
>>>>> here.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 8/4/20 11:06 PM, Felix Cheung wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> So IMO maintaining outside in a separate repo is going
>>>>>
>>>>> to be harder. That was why I asked.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>>
>>>>> *From:* Maciej Szymkiewicz
>>>>>
>>>>> <mszymkiewicz@gmail.com> <mszymkiewicz@gmail.com>
>>>>>
>>>>>
>>>>> *Sent:* Tuesday, August 4, 2020 12:59 PM
>>>>>
>>>>>
>>>>> *To:* Sean Owen
>>>>>
>>>>>
>>>>> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko;
>>>>>
>>>>> Holden Karau; Spark Dev List
>>>>>
>>>>>
>>>>> *Subject:* Re: [PySpark] Revisiting PySpark type
>>>>>
>>>>> annotations
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 8/4/20 9:35 PM, Sean Owen wrote
>>>>>
>>>>>
>>>>> > Yes, but the general argument you make here is: if
>>>>>
>>>>> you tie this
>>>>>
>>>>>
>>>>> > project to the main project, it will _have_ to be
>>>>>
>>>>> maintained by
>>>>>
>>>>>
>>>>> > everyone. That's good, but also exactly I think the
>>>>>
>>>>> downside we want
>>>>>
>>>>>
>>>>> > to avoid at this stage (I thought?) I understand
>>>>>
>>>>> for some
>>>>>
>>>>>
>>>>> > undertakings, it's just not feasible to start
>>>>>
>>>>> outside the main
>>>>>
>>>>>
>>>>> > project, but is there no proof of concept even
>>>>>
>>>>> possible before taking
>>>>>
>>>>>
>>>>> > this step -- which more or less implies it's going
>>>>>
>>>>> to be owned and
>>>>>
>>>>>
>>>>> > merged and have to be maintained in the main
>>>>>
>>>>> project.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I think we have a bit different understanding here ‒ I
>>>>>
>>>>> believe we have
>>>>>
>>>>>
>>>>> reached a conclusion that maintaining annotations within
>>>>>
>>>>> the project is
>>>>>
>>>>>
>>>>> OK, we only differ when it comes to specific form it
>>>>>
>>>>> should take.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> As of POC ‒ we have stubs, which have been maintained
>>>>>
>>>>> over three years
>>>>>
>>>>>
>>>>> now and cover versions between 2.3 (though these are
>>>>>
>>>>> fairly limited) to,
>>>>>
>>>>>
>>>>> with some lag, current master.  There is some evidence
>>>>>
>>>>> there are used in
>>>>>
>>>>>
>>>>> the wild
>>>>>
>>>>>
>>>>> (
>>>>> https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D
>>>>> ),
>>>>>
>>>>>
>>>>> there are a few contributors
>>>>>
>>>>>
>>>>> (https://github.com/zero323/pyspark-stubs/graphs/contributors)
>>>>>
>>>>> and at
>>>>>
>>>>>
>>>>> least some use cases (https://stackoverflow.com/q/40163106/).
>>>>>
>>>>> So,
>>>>>
>>>>>
>>>>> subjectively speaking, it seems we're already beyond
>>>>>
>>>>> POC.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Maciej Szymkiewicz
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Web: https://zero323.net
>>>>>
>>>>>
>>>>> Keybase: https://keybase.io/zero323
>>>>>
>>>>>
>>>>> Gigs: https://www.codementor.io/@zero323
>>>>>
>>>>>
>>>>> PGP: A30CEF0C31A501EC
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Maciej Szymkiewicz
>>>>>
>>>>>
>>>>>
>>>>> Web: https://zero323.net
>>>>>
>>>>> Keybase: https://keybase.io/zero323
>>>>>
>>>>> Gigs: https://www.codementor.io/@zero323
>>>>>
>>>>> PGP: A30CEF0C31A501EC
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message