spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23883) Error with conversion to arrow while using pandas_udf
Date Sun, 08 Apr 2018 19:39:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429864#comment-16429864
] 

Omri commented on SPARK-23883:
------------------------------

Yes it does. Thank you! I missed that part on the documentation.

I did find a weird behavior related to the order of the objects in the struct (if you wish,
I can open a new issue on this).

When I define the schema as this one:

 
{code:java}
StructType([
  StructField("CarId", IntegerType()),
  StructField("Distance", FloatType())
])
{code}
It doesn't use the names of the returned data frame by the pandas_udf, which results in a
wrong assignment of types. The CarId would get a float value and the Distance would get cast
into Integer.

 

Here's the result for example:
{code:java}
+-----+--------+
|CarId|Distance|
+-----+--------+
|    3|    29.0|
|    3|    65.0|
|    3|   191.0|
|    3|   222.0|
|    3|    19.0|
{code}
The pandas_udf returns 3.5 which gets truncated into 3.

When I replace the order of the struct into
{code:java}
schema = StructType([
  StructField("Distance", FloatType()),
  StructField("CarId", IntegerType())
])
{code}
I get this result:
{code:java}
+--------+-----+
|Distance|CarId|
+--------+-----+
|     3.5|   29|
|     3.5|   65|
|     3.5|  191|
|     3.5|  222|
|     3.5|   19|
{code}
I would assume that Spark would map the names of the returned pandas data frame columns with
the StructField names.

 

Thanks again

> Error with conversion to arrow while using pandas_udf
> -----------------------------------------------------
>
>                 Key: SPARK-23883
>                 URL: https://issues.apache.org/jira/browse/SPARK-23883
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>         Environment: Spark 2.3.0
> Python 3.5
> Java 1.8.0_161-b12
>            Reporter: Omri
>            Priority: Major
>
> Hi,
> I have a code that works on DataBricks but doesn't work on a local spark installation.
> This is the code I'm running:
> {code:java}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> import numpy as np
> from pyspark.sql.types import *
> schema = StructType([
>   StructField("Distance", FloatType()),
>   StructField("CarId", IntegerType())
> ])
> def haversine(lon1, lat1, lon2, lat2):
>     #Calculate distance, return scalar
>     return 3.5 # Removed logic to facilitate reading
> @pandas_udf(schema)
> def totalDistance(oneCar):
>     dist = haversine(oneCar.Longtitude.shift(1),
>                      oneCar.Latitude.shift(1),
>                      oneCar.loc[1:, 'Longitude'], 
>                      oneCar.loc[1:, 'Latitude'])
>     return pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index
= [0])
> ## Calculate the overall distance made by each car
> distancePerCar= df.groupBy('CarId').apply(totalDistance)
> {code}
> I'm getting this exception, about Arrow not able to deal with this input:
> {noformat}
> ---------------------------------------------------------------------------
> TypeError                                 Traceback (most recent call last)
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
>     114             try:
> --> 115                 to_arrow_type(self._returnType_placeholder)
>     116             except TypeError:
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in to_arrow_type(dt)
>    1641     else:
> -> 1642         raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
>    1643     return arrow_type
> TypeError: Unsupported type in conversion to Arrow: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
> During handling of the above exception, another exception occurred:
> NotImplementedError                       Traceback (most recent call last)
> <ipython-input-35-4f2194cfb998> in <module>()
>      18     km = 6367 * c
>      19     return km
> ---> 20 @pandas_udf("CarId: int, Distance: float")
>      21 def totalDistance(oneUser):
>      22     dist = haversine(oneUser.Longtitude.shift(1), oneUser.Latitude.shift(1),
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _create_udf(f, returnType,
evalType)
>      62     udf_obj = UserDefinedFunction(
>      63         f, returnType=returnType, name=None, evalType=evalType, deterministic=True)
> ---> 64     return udf_obj._wrapped()
>      65 
>      66 
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _wrapped(self)
>     184 
>     185         wrapper.func = self.func
> --> 186         wrapper.returnType = self.returnType
>     187         wrapper.evalType = self.evalType
>     188         wrapper.deterministic = self.deterministic
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
>     117                 raise NotImplementedError(
>     118                     "Invalid returnType with scalar Pandas UDFs: %s is "
> --> 119                     "not supported" % str(self._returnType_placeholder))
>     120         elif self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
>     121             if isinstance(self._returnType_placeholder, StructType):
> NotImplementedError: Invalid returnType with scalar Pandas UDFs: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
is not supported{noformat}
> I've also tried changing the schema to
> {code:java}
> @pandas_udf("<CarId:int,Distance:float>") {code}
> and
> {code:java}
> @pandas_udf("CarId:int,Distance:float"){code}
>  
> As mentioned, this is working on a DataBricks instance in Azure, but not locally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message