spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omri (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-23929) pandas_udf schema mapped by position and not by name
Date Tue, 10 Apr 2018 14:23:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432344#comment-16432344
] 

Omri edited comment on SPARK-23929 at 4/10/18 2:22 PM:
-------------------------------------------------------

[~icexelloss], I couldn't recreate the problem I had where the order was mixed, but I have
a different example to illustrate the problem.

Here the schema struct is [id,zeros,ones] but the user returned a data frame with [id,ones,zeros].
The column names are taken from the provided schema and not from the explicitly mentioned
data frame.

 
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pandas as pd
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  

schema = StructType([
    StructField("id", LongType()),
    StructField("zeros", DoubleType()),
    StructField("ones", DoubleType())
])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def constants(grp):
    return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0])
    
df.groupby("id").apply(constants).show()
{code}
results:
{code:java}
+---+-----+----+
| id|zeros|ones|
+---+-----+----+
|  1|  1.0| 0.0|
|  2|  1.0| 0.0|
+---+-----+----+
{code}
So the user was expecting ones to have 1 and zeros to have 0 but they get wrong results.

 


was (Author: omri374):
[~icexelloss], I couldn't recreate the problem I had where the order was mixed, but I have
a different example to illustrate the problem.

Here the schema struct is [id,zeros,ones] but the user returned a data frame with [id,ones,zeros].
The column names are taken from the provided schema and not from the explicitly mentioned
data frame.

 
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pandas as pd
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  

schema = StructType([
    StructField("id", LongType()),
    StructField("zeros", DoubleType()),
    StructField("ones", DoubleType())
])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def median_per_group(grp):
    return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0])
    
df.groupby("id").apply(median_per_group).show()
{code}
results:
{code:java}
+---+-----+----+
| id|zeros|ones|
+---+-----+----+
|  1|  1.0| 0.0|
|  2|  1.0| 0.0|
+---+-----+----+
{code}
So the user was expecting ones to have 1 and zeros to have 0 but they get wrong results.

 

> pandas_udf schema mapped by position and not by name
> ----------------------------------------------------
>
>                 Key: SPARK-23929
>                 URL: https://issues.apache.org/jira/browse/SPARK-23929
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>         Environment: PySpark
> Spark 2.3.0
>  
>            Reporter: Omri
>            Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by name. Currently
it's not the case.
> Consider these two examples, where the only change is the order of the fields in the
provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+-------------------+
> | id|                  v|
> +---+-------------------+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+-------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message