spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omri (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-23929) pandas_udf schema mapped by position and not by name
Date Mon, 09 Apr 2018 06:53:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Omri updated SPARK-23929:
-------------------------
    Description: 
The return struct of a pandas_udf should be mapped to the provided schema by name. Currently
it's not the case.

Consider these two examples, where the only change is the order of the fields in the provided
schema struct:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  
@pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())
df.groupby("id").apply(normalize).show() 
{code}
and this one:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  
@pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())
df.groupby("id").apply(normalize).show()
{code}
The results should be the same but they are different:

For the first code:
{code:java}
+---+---+
|  v| id|
+---+---+
|1.0|  0|
|1.0|  0|
|2.0|  0|
|2.0|  0|
|2.0|  1|
+---+---+
{code}
For the second code:
{code:java}
+---+-------------------+
| id|                  v|
+---+-------------------+
|  1|-0.7071067811865475|
|  1| 0.7071067811865475|
|  2|-0.8320502943378437|
|  2|-0.2773500981126146|
|  2| 1.1094003924504583|
+---+-------------------+


{code}

  was:
The return struct of a pandas_udf should be mapped to the provided schema by name. Currently
it's not the case. Consider these two examples, where the only change is the order of the
fields in the provided schema struct:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  
@pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())
df.groupby("id").apply(normalize).show() 
{code}
and this one:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  
@pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())
df.groupby("id").apply(normalize).show()
{code}
The results should be the same but they are different:

For the first code:
{code:java}
+---+---+
|  v| id|
+---+---+
|1.0|  0|
|1.0|  0|
|2.0|  0|
|2.0|  0|
|2.0|  1|
+---+---+
{code}
For the second code:
{code:java}
+---+-------------------+
| id|                  v|
+---+-------------------+
|  1|-0.7071067811865475|
|  1| 0.7071067811865475|
|  2|-0.8320502943378437|
|  2|-0.2773500981126146|
|  2| 1.1094003924504583|
+---+-------------------+


{code}


> pandas_udf schema mapped by position and not by name
> ----------------------------------------------------
>
>                 Key: SPARK-23929
>                 URL: https://issues.apache.org/jira/browse/SPARK-23929
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>         Environment: PySpark
> Spark 2.3.0
>  
>            Reporter: Omri
>            Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by name. Currently
it's not the case.
> Consider these two examples, where the only change is the order of the fields in the
provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+-------------------+
> | id|                  v|
> +---+-------------------+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+-------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message