spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-28818) FrequentItems applies an incorrect schema to the resulting dataframe when nulls are present
Date Thu, 29 Aug 2019 01:50:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-28818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon updated SPARK-28818:
---------------------------------
    Component/s:     (was: Spark Core)
                 SQL

> FrequentItems applies an incorrect schema to the resulting dataframe when nulls are present
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-28818
>                 URL: https://issues.apache.org/jira/browse/SPARK-28818
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.3
>            Reporter: Matt Hawes
>            Priority: Minor
>
> A trivially reproducible bug in the code for `FrequentItems`. The schema for the resulting
arrays of frequent items is [hard coded|#L122]] to have non-nullable array elements:
> {code:scala}
> val outputCols = colInfo.map { v =>
> StructField(v._1 + "_freqItems", ArrayType(v._2, false))
>  }
>  val schema = StructType(outputCols).toAttributes
>  Dataset.ofRows(df.sparkSession, LocalRelation.fromExternalRows(schema, Seq(resultRow)))
> {code}
>  
> However if the column contains frequent nulls then these nulls are included in the frequent
items array. This results in various errors such as any attempt to `collect()` resulting in
a null pointer exception:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.Builder().getOrCreate()
> df = spark.createDataFrame([
>     (1, 'a'),
>     (2, None),
>     (3, 'b'),
> ], schema="id INTEGER, val STRING")
> rows = df.freqItems(df.columns).collect()
> {code}
>  Results in:
> {code:java}
> Traceback (most recent call last):                                  
           
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line
533, in collect
>     sock_info = self._jdf.collectToPython()
>   File "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__
>   File "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.py", line
63, in deco
>     return f(*a, **kw)
>   File "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o40.collectToPython.
> : java.lang.NullPointerException
> 	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
> 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
> 	at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
> 	at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> 	at scala.collection.immutable.List.foreach(List.scala:392)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> 	at scala.collection.immutable.List.map(List.scala:296)
> 	at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:44)
> 	at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:39)
> 	at org.apache.spark.sql.execution.LocalTableScanExec.executeCollect(LocalTableScanExec.scala:70)
> 	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3257)
> 	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3254)
> 	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
> 	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> 	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> 	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> 	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
> 	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3254)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> 	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> 	at py4j.Gateway.invoke(Gateway.java:282)
> 	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> 	at py4j.commands.CallCommand.execute(CallCommand.java:79)
> 	at py4j.GatewayConnection.run(GatewayConnection.java:238)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}
> Unclear if the hardcoding is at fault or if the algorithm is actually designed to not
return nulls even if they are frequent. In which case the hard coding would be appropriate.
I'll put a PR in that assumes that the hardcoding is the bug unless people know otherwise?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message