Hi,
after upgrading from 2.3.2 to 2.4.0 we recognized a regression when using
posexplode() in conjunction with select of another struct fields.
Given a structure like this:
=============================
>>> df = (spark.range(1)
... .withColumn("my_arr", array(lit("1"), lit("2")))
... .withColumn("bar", lit("1"))
... .select("id", "my_arr", struct("bar").alias("foo"))
... )
>>>
>>> df.printSchema()
root
|-- id: long (nullable = false)
|-- my_arr: array (nullable = false)
| |-- element: string (containsNull = false)
|-- foo: struct (nullable = false)
| |-- bar: string (nullable = false)
Spark 2.3.2
===========
>>>
>>> df = df.select(posexplode("my_arr"), "foo.bar")
>>>
>>> df.printSchema()
root
|-- pos: integer (nullable = false)
|-- col: string (nullable = false)
|-- bar: string (nullable = false)
selecting "foo.bar" results in field "bar".
Spark 2.4.0
===========
>>>
>>> df = df.select(posexplode("my_arr"), "foo.bar")
>>>
>>> df.printSchema()
root
|-- pos: integer (nullable = false)
|-- col: string (nullable = false)
|-- foo.bar: string (nullable = false)
In 2.4 'bar' now gets 'foo.bar', which is not what we would expect.
So existing code having .select("bar") will fail.
>>> df.select("bar").show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/andreas/Downloads/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py",
line 1320, in select
jdf = self._jdf.select(self._jcols(*cols))
File
"/home/andreas/Downloads/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__
File
"/home/andreas/Downloads/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/utils.py",
line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve '`bar`' given input
columns: [pos, col, foo.bar];;\n'Project ['bar]\n+- Project [pos#14,
col#15, foo#9.bar AS foo.bar#16]\n +- Generate posexplode(my_arr#2),
false, [pos#14, col#15]\n +- Project [id#0L, my_arr#2,
named_struct(bar, bar#5) AS foo#9]\n +- Project [id#0L, my_arr#2, 1
AS bar#5]\n +- Project [id#0L, array(1, 2) AS my_arr#2]\n
+- Range (0, 1, step=1, splits=Some(4))\n"
Is this a known issue / intended behavior?
Regards
Andreas
|