Wenchen,

There are a few issues with the Invoke approach, and I don’t think that it is really much better for the additional complexity of the API.

First I think that you’re confusing codegen to call a function with codegen to implement a function. The non-goal refers to supporting codegen to implement a UDF. That’s what could have differences between the called version and generated version. But the Invoke option isn’t any different in that case because Invoke codegen is only used to call a method, and we can codegen int result = udfName(x, y) just like we can codegen int result = udfName(row).

The Invoke approach also has a problem with expressiveness. Consider a map function that builds a map from its inputs as key/value pairs: map("x", r * cos(theta), "y", r * sin(theta)). If this requires a defined Java function, then there must be lots of implementations for different numbers of pairs, for different types, etc:

public MapData buildMap(String k1, int v1);
...
public MapData buildMap(String k1, long v1);
...
public MapData buildMap(String k1, float v1);
...
public MapData buildMap(String k1, double v1);
public MapData buildMap(String k1, double v1, String k2, double v2);
public MapData buildMap(String k1, double v1, String k2, double v2, String k3, double v3);
...

Clearly, this and many other use cases would fall back to varargs instead. In that case, there is little benefit to using invoke because all of the arguments will get collected into an Object[] anyway. That’s basically the same thing as using a row object, just without convenience functions that return specific types like getString, forcing implementations to cast instead. And, the Invoke approach has a performance penalty when existing rows could be simply projected using a wrapper.

There’s also a massive performance penalty for the Invoke approach when falling back to non-codegen because the function is loaded and invoked each time eval is called. It is much cheaper to use a method in an interface.

Next, the Invoke approach is much more complicated for implementers to use. Should they use String or UTF8String? What representations are supported and how will Spark detect and produce those representations? What if a function uses both String and UTF8String? Will Spark detect this for each parameter? Having one or two functions called by Spark is much easier to maintain in Spark and avoid a lot of debugging headaches when something goes wrong.

On Mon, Feb 8, 2021 at 12:00 PM Wenchen Fan <cloud0fan@gmail.com> wrote:

This is a very important feature, thanks for working on it!

Spark uses codegen by default, and it's a bit unfortunate to see that codegen support is treated as a non-goal. I think it's better to not ask the UDF implementations to provide two different code paths for interpreted evaluation and codegen evaluation. The Expression API does so and it's very painful. Many bugs were found due to inconsistencies between the interpreted and codegen code paths.

Now, Spark has the infra to call arbitrary Java functions in both interpreted and codegen code paths, see StaticInvoke and Invoke. I think we are able to define the UDF API in the most efficient way. For example, a UDF that takes long and string, and returns int:

class MyUDF implements ... {
  int call(long arg1, UTF8String arg2) { ... }
}

There is no compile-time type-safety. But there is also no boxing, no extra InternalRow building, no separated interpreted and codegen code paths. The UDF will report input data types and result data type, so the analyzer can check if the call method is valid via reflection, and we still have query-compile-time type safety. It also simplifies development as we can just use the Invoke expression to invoke UDFs.

On Tue, Feb 9, 2021 at 2:52 AM Ryan Blue <blue@apache.org> wrote:
Hi everyone,

I'd like to start a discussion for adding a FunctionCatalog interface to catalog plugins. This will allow catalogs to expose functions to Spark, similar to how the TableCatalog interface allows a catalog to expose tables. The proposal doc is available here: https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit

Here's a high-level summary of some of the main design choices:
* Adds the ability to list and load functions, not to create or modify them in an external catalog
* Supports scalar, aggregate, and partial aggregate functions
* Uses load and bind steps for better error messages and simpler implementations
* Like the DSv2 table read and write APIs, it uses InternalRow to pass data
* Can be extended using mix-in interfaces to add vectorization, codegen, and other future features

There is also a PR with the proposed API: https://github.com/apache/spark/pull/24559/files

Let's discuss the proposal here rather than on that PR, to get better visibility. Also, please take the time to read the proposal first. That really helps clear up misconceptions.



--
Ryan Blue

--
Ryan Blue