spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message
Date Sat, 02 Feb 2019 04:09:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758876#comment-16758876
] 

Hyukjin Kwon commented on SPARK-26810:
--------------------------------------

{code}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
    "but got %s" % (self, len(self), args))
  File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
    return "<Row(%s)>" % ", ".join(self)
TypeError: sequence item 0: expected str instance, list found
{code}

Is another issue, I guess SPARK-23299.

Are you sure SPARK-25072 is the cause? I don't see the relevant error messages.

> Fixing SPARK-25072 broke existing code and fails to show error message
> ----------------------------------------------------------------------
>
>                 Key: SPARK-26810
>                 URL: https://issues.apache.org/jira/browse/SPARK-26810
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Arttu Voutilainen
>            Priority: Minor
>
> Hey,
> We upgraded Spark recently, and https://issues.apache.org/jira/browse/SPARK-25072 caused
our pipeline to fail after the upgrade. Annoyingly, the error message formatting also threw
an exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
>     "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
>     return "<Row(%s)>" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), T.StructField('b',
T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The unexpected
error came from __repr__ as it assumes that the arguments given to Row constructor are strings.
That sounds like a reasonable assumption, should the Row constructor validate that it holds
true maybe? (I guess that might be another potentially breaking change though, if someone
has as weird code as this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message