spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <>
Subject Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
Date Fri, 01 Sep 2017 18:07:48 GMT
+1 on this and like the suggestion of type in string form.

Would it be correct to assume there will be data type check, for example the returned pandas
data frame column data types match what are specified. We have seen quite a bit of issues/confusions
with that in R.

Would it make sense to have a more generic decorator name so that it could also be useable
for other efficient vectorized format in the future? Or do we anticipate the decorator to
be format specific and will have more in the future?

From: Reynold Xin <>
Sent: Friday, September 1, 2017 5:16:11 AM
To: Takuya UESHIN
Cc: spark-dev
Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Ok, thanks.

+1 on the SPIP for scope etc

On API details (will deal with in code reviews as well but leaving a note here in case I forget)

1. I would suggest having the API also accept data type specification in string form. It is
usually simpler to say "long" then "LongType()".

2. Think about what error message to show when the rows numbers don't match at runtime.

On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <<>>
Yes, the aggregation is out of scope for now.
I think we should continue discussing the aggregation at JIRA and we will be adding those
later separately.


On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <<>>
Is the idea aggregate is out of scope for the current effort and we will be adding those later?

On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <<>>
Hi all,

We've been discussing to support vectorized UDFs in Python and we almost got a consensus about
the APIs, so I'd like to summarize and call for a vote.

Note that this vote should focus on APIs for vectorized UDFs, not APIs for vectorized UDAFs
or Window operations.

Proposed API

We introduce a @pandas_udf decorator (or annotation) to define vectorized UDFs which takes
one or more pandas.Series or one integer value meaning the length of the input value for 0-parameter
UDFs. The return value should be pandas.Series of the specified type and the length of the
returned value should be the same as input value.

We can define vectorized UDFs as:

  def plus(v1, v2):
      return v1 + v2

or we can define as:

  plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

We can use it similar to row-by-row UDFs:

  df.withColumn('sum', plus(df.v1, df.v2))

As for 0-parameter UDFs, we can define and use as:

  def f0(size):
      return pd.Series(1).repeat(size)

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical reasons.


Tokyo, Japan

Tokyo, Japan

View raw message