spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrés Doncel Ramírez (JIRA) <j...@apache.org>
Subject [jira] [Updated] (SPARK-26869) UDF with struct requires to have _1 and _2 as struct field names
Date Wed, 13 Feb 2019 12:06:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-26869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrés Doncel Ramírez updated SPARK-26869:
------------------------------------------
    Description: 
When using a UDF which has a Seq of tuples as input, the struct field names need to match
"_1" and "_2". The following code illustrates this:

 
{code:java}
val df = sc.parallelize(Array(
  ("1",3.0),
  ("2",4.5),
  ("5",2.0)
)
).toDF("c1","c2")

val df1=df.agg(collect_list(struct("c1","c2")).as("c3"))
// Changing column names to _1 and _2 when creating the struct
val df2=df.agg(collect_list(struct(col("c1").as("_1"),col("c2").as("_2"))).as("c3"))

def takeUDF = udf({ (xs: Seq[(String, Double)]) =>
  xs.take(2)
})

df1.printSchema
df2.printSchema

df1.withColumn("c4",takeUDF(col("c3"))).show() // this fails

df2.withColumn("c4",takeUDF(col("c3"))).show() // this works
{code}
The first one returns the following exception:

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(c3)' due to data type mismatch:
argument 1 requires array<struct<_1:string,_2:double>> type, however, '`c3`' is
of array<struct<c1:string,c2:double>> type.;;

While the second works as expected and prints the result.

  was:
When using a UDF which has a Seq of tuples as input, the struct field names need to match
"_1" and "_2". The following code illustrates this.

 
{code:java}
val df = sc.parallelize(Array(
  ("1",3.0),
  ("2",4.5),
  ("5",2.0)
)
).toDF("c1","c2")

val df1=df.agg(collect_list(struct("c1","c2")).as("c3"))
// Changing column names to _1 and _2 when creating the struct
val df2=df.agg(collect_list(struct(col("c1").as("_1"),col("c2").as("_2"))).as("c3"))

def takeUDF = udf({ (xs: Seq[(String, Double)]) =>
  xs.take(2)
})

df1.printSchema
df2.printSchema

df1.withColumn("c4",takeUDF(col("c3"))).show() // this fails

df2.withColumn("c4",takeUDF(col("c3"))).show() // this works
{code}
The first one returns the following exception:

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(c3)' due to data type mismatch:
argument 1 requires array<struct<_1:string,_2:double>> type, however, '`c3`' is
of array<struct<c1:string,c2:double>> type.;;

While the second works as expected and prints the result.


> UDF with struct requires to have _1 and _2 as struct field names
> ----------------------------------------------------------------
>
>                 Key: SPARK-26869
>                 URL: https://issues.apache.org/jira/browse/SPARK-26869
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0, 2.4.0
>         Environment: Ubuntu 18.04.1 LTS
>            Reporter: Andrés Doncel Ramírez
>            Priority: Minor
>
> When using a UDF which has a Seq of tuples as input, the struct field names need to match
"_1" and "_2". The following code illustrates this:
>  
> {code:java}
> val df = sc.parallelize(Array(
>   ("1",3.0),
>   ("2",4.5),
>   ("5",2.0)
> )
> ).toDF("c1","c2")
> val df1=df.agg(collect_list(struct("c1","c2")).as("c3"))
> // Changing column names to _1 and _2 when creating the struct
> val df2=df.agg(collect_list(struct(col("c1").as("_1"),col("c2").as("_2"))).as("c3"))
> def takeUDF = udf({ (xs: Seq[(String, Double)]) =>
>   xs.take(2)
> })
> df1.printSchema
> df2.printSchema
> df1.withColumn("c4",takeUDF(col("c3"))).show() // this fails
> df2.withColumn("c4",takeUDF(col("c3"))).show() // this works
> {code}
> The first one returns the following exception:
> org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(c3)' due to data type mismatch:
argument 1 requires array<struct<_1:string,_2:double>> type, however, '`c3`' is
of array<struct<c1:string,c2:double>> type.;;
> While the second works as expected and prints the result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message