spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Cutler <cutl...@gmail.com>
Subject Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction
Date Wed, 13 Nov 2019 07:27:07 GMT
Thanks all. I created a WIP PR at https://github.com/apache/spark/pull/26496,
we can further discuss the details in there.

On Thu, Nov 7, 2019 at 7:01 PM Takuya UESHIN <ueshin@happy-camper.st> wrote:

> +1
>
> On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp <sknapp@berkeley.edu> wrote:
>
>> +1
>>
>> On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>> >
>> > +1
>> >
>> > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cloud0fan@gmail.com>님이
작성:
>> >>
>> >> Sounds reasonable to me. We should make the behavior consistent within
>> Spark.
>> >>
>> >> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cutlerb@gmail.com> wrote:
>> >>>
>> >>> Currently, when a PySpark Row is created with keyword arguments, the
>> fields are sorted alphabetically. This has created a lot of confusion with
>> users because it is not obvious (although it is stated in the pydocs) that
>> they will be sorted alphabetically. Then later when applying a schema and
>> the field order does not match, an error will occur. Here is a list of some
>> of the JIRAs that I have been tracking all related to this issue:
>> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
>> of the issue [1].
>> >>>
>> >>> The original reason for sorting fields is because kwargs in python <
>> 3.6 are not guaranteed to be in the same order that they were entered [2].
>> Sorting alphabetically ensures a consistent order. Matters are further
>> complicated with the flag _from_dict_ that allows the Row fields to to be
>> referenced by name when made by kwargs, but this flag is not serialized
>> with the Row and leads to inconsistent behavior. For instance:
>> >>>
>> >>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A
>> string").first()
>> >>> Row(B='2', A='1')
>> >>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1",
>> B="2")]), "B string, A string").first()
>> >>> Row(B='1', A='2')
>> >>>
>> >>> I think the best way to fix this is to remove the sorting of fields
>> when constructing a Row. For users with Python 3.6+, nothing would change
>> because these versions of Python ensure that the kwargs stays in the
>> ordered entered. For users with Python < 3.6, using kwargs would check a
>> conf to either raise an error or fallback to a LegacyRow that sorts the
>> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
>> can also be removed at the same time. There are also other ways to create
>> Rows that will not be affected. I have opened a JIRA [3] to capture this,
>> but I am wondering what others think about fixing this for Spark 3.0?
>> >>>
>> >>> [1] https://github.com/apache/spark/pull/20280
>> >>> [2] https://www.python.org/dev/peps/pep-0468/
>> >>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>

Mime
View raw message