spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction
Date Wed, 06 Nov 2019 14:38:21 GMT
Sounds reasonable to me. We should make the behavior consistent within
Spark.

On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cutlerb@gmail.com> wrote:

> Currently, when a PySpark Row is created with keyword arguments, the
> fields are sorted alphabetically. This has created a lot of confusion with
> users because it is not obvious (although it is stated in the pydocs) that
> they will be sorted alphabetically. Then later when applying a schema and
> the field order does not match, an error will occur. Here is a list of some
> of the JIRAs that I have been tracking all related to this issue:
> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
> of the issue [1].
>
> The original reason for sorting fields is because kwargs in python < 3.6
> are not guaranteed to be in the same order that they were entered [2].
> Sorting alphabetically ensures a consistent order. Matters are further
> complicated with the flag _*from_dict*_ that allows the Row fields to to
> be referenced by name when made by kwargs, but this flag is not serialized
> with the Row and leads to inconsistent behavior. For instance:
>
> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
> Row(B='2', A='1')>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1",
B="2")]), "B string, A string").first()
> Row(B='1', A='2')
>
> I think the best way to fix this is to remove the sorting of fields when
> constructing a Row. For users with Python 3.6+, nothing would change
> because these versions of Python ensure that the kwargs stays in the
> ordered entered. For users with Python < 3.6, using kwargs would check a
> conf to either raise an error or fallback to a LegacyRow that sorts the
> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
> can also be removed at the same time. There are also other ways to create
> Rows that will not be affected. I have opened a JIRA [3] to capture this,
> but I am wondering what others think about fixing this for Spark 3.0?
>
> [1] https://github.com/apache/spark/pull/20280
> [2] https://www.python.org/dev/peps/pep-0468/
> [3] https://issues.apache.org/jira/browse/SPARK-29748
>
>

Mime
View raw message