spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 刘崇光 <lcg31...@gmail.com>
Subject Fwd: array_contains in package org.apache.spark.sql.functions
Date Thu, 14 Jun 2018 09:15:05 GMT
---------- Forwarded message ----------
From: 刘崇光 <lcg31439@gmail.com>
Date: Thu, Jun 14, 2018 at 11:08 AM
Subject: array_contains in package org.apache.spark.sql.functions
To: user@spark.apache.org


Hello all,

I ran into a use case in project with spark sql and want to share with you
some thoughts about the function array_contains.

Say I have a Dataframe containing 2 columns. Column A of type "Array of
String" and Column B of type "String". I want to determine if the value of
column B is contained in the value of column A, without using a udf of
course.
The function array_contains came into my mind naturally:

def array_contains(column: Column, value: Any): Column = withExpr {
  ArrayContains(column.expr, Literal(value))
}

However the function takes the column B and does a "Literal" of column B,
which yields a runtime exception: RuntimeException("Unsupported literal
type " + v.getClass + " " + v).

Then after discussion with my friends, we fund a solution without using udf:

new Column(ArrayContains(df("ColumnA").expr, df("ColumnB").expr)


With this solution, I think of empowering a little bit more the function,
by doing like this:

def array_contains(column: Column, value: Any): Column = withExpr {
  value match {
    case c: Column => ArrayContains(column.expr, c.expr)
    case _ => ArrayContains(column.expr, Literal(value))
  }
}


It does a pattern matching to detect if value is of type Column. If yes, it
will use the .expr of the column, otherwise it will work as it used to.

Any suggestion or opinion on the proposition?


Kind regards,
Chongguang LIU

Mime
View raw message