spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: [DISCUSS] SPIP: FunctionCatalog
Date Tue, 09 Feb 2021 10:36:50 GMT
Hi Ryan,

Sorry if I didn't make it clear. I was referring to implementing UDF using
codegen, not calling the UDF with codegen or not. Calling UDF is Spark's
job and it doesn't matter if the UDF API uses row or individual parameters,
as you said. My point is, it's a bad idea to follow the Expression API to
separate the interpreted and codegen code paths in the UDF API, as it adds
implementation complexity and is error-prone (people need to keep
interpreted and codegen in sync).

You made a good point that the Invoke approach can lead to methods
expansion. It's a common problem in Java if you want to specialize the
generic type, and Scala can mitigate this a bit using syntax sugar (type
specification
<https://stackoverflow.com/questions/13173181/specialization-of-generic-functions-in-scala-or-java>).
Varargs is a problem and in that case I agree the row parameter is
better. The current Scala/Java UDF doesn't support varargs and I'm not sure
how common it is. The UDF can take struct type inputs, which kind of
supports varargs as people can do SELECT my_udf(struct(1, 'abc', ...)).

> And, the Invoke approach has a performance penalty when existing rows
could be simply projected using a wrapper.

This only happens in the interpreted code path. When the query falls back
to interpreted execution, it'll be very slow and a little performance
penalty of UDF doesn't really matter.

> There’s also a massive performance penalty for the Invoke approach when
falling back to non-codegen because the function is loaded and invoked each
time eval is called. It is much cheaper to use a method in an interface.

Can you elaborate? Using the row parameter or individual parameters
shouldn't change the life cycle of the UDF instance.

> Should they use String or UTF8String? What representations are supported
and how will Spark detect and produce those representations?

It's the same as InternalRow. We can just copy-paste the document of
InternalRow to explain the corresponding java type for each data type.

On Tue, Feb 9, 2021 at 6:28 AM Ryan Blue <blue@apache.org> wrote:

> Wenchen,
>
> There are a few issues with the Invoke approach, and I don’t think that
> it is really much better for the additional complexity of the API.
>
> First I think that you’re confusing codegen to call a function with
> codegen to implement a function. The non-goal refers to supporting codegen
> to *implement* a UDF. That’s what could have differences between the
> called version and generated version. But the Invoke option isn’t any
> different in that case because Invoke codegen is only used to call a
> method, and we can codegen int result = udfName(x, y) just like we can
> codegen int result = udfName(row).
>
> The Invoke approach also has a problem with expressiveness. Consider a map
> function that builds a map from its inputs as key/value pairs: map("x", r
> * cos(theta), "y", r * sin(theta)). If this requires a defined Java
> function, then there must be lots of implementations for different numbers
> of pairs, for different types, etc:
>
> public MapData buildMap(String k1, int v1);
> ...
> public MapData buildMap(String k1, long v1);
> ...
> public MapData buildMap(String k1, float v1);
> ...
> public MapData buildMap(String k1, double v1);
> public MapData buildMap(String k1, double v1, String k2, double v2);
> public MapData buildMap(String k1, double v1, String k2, double v2, String k3, double
v3);
> ...
>
> Clearly, this and many other use cases would fall back to varargs instead.
> In that case, there is little benefit to using invoke because all of the
> arguments will get collected into an Object[] anyway. That’s basically
> the same thing as using a row object, just without convenience functions
> that return specific types like getString, forcing implementations to
> cast instead. And, the Invoke approach has a performance *penalty* when
> existing rows could be simply projected using a wrapper.
>
> There’s also a massive performance penalty for the Invoke approach when
> falling back to non-codegen because the function is loaded and invoked each
> time eval is called. It is much cheaper to use a method in an interface.
>
> Next, the Invoke approach is much more complicated for implementers to
> use. Should they use String or UTF8String? What representations are
> supported and how will Spark detect and produce those representations? What
> if a function uses both String *and* UTF8String? Will Spark detect this
> for each parameter? Having one or two functions called by Spark is much
> easier to maintain in Spark and avoid a lot of debugging headaches when
> something goes wrong.
>
> On Mon, Feb 8, 2021 at 12:00 PM Wenchen Fan <cloud0fan@gmail.com> wrote:
>
> This is a very important feature, thanks for working on it!
>>
>> Spark uses codegen by default, and it's a bit unfortunate to see that
>> codegen support is treated as a non-goal. I think it's better to not ask
>> the UDF implementations to provide two different code paths for interpreted
>> evaluation and codegen evaluation. The Expression API does so and it's very
>> painful. Many bugs were found due to inconsistencies between
>> the interpreted and codegen code paths.
>>
>> Now, Spark has the infra to call arbitrary Java functions in
>> both interpreted and codegen code paths, see StaticInvoke and Invoke. I
>> think we are able to define the UDF API in the most efficient way.
>> For example, a UDF that takes long and string, and returns int:
>>
>> class MyUDF implements ... {
>>   int call(long arg1, UTF8String arg2) { ... }
>> }
>>
>> There is no compile-time type-safety. But there is also no boxing, no
>> extra InternalRow building, no separated interpreted and codegen code
>> paths. The UDF will report input data types and result data type, so the
>> analyzer can check if the call method is valid via reflection, and we
>> still have query-compile-time type safety. It also simplifies development
>> as we can just use the Invoke expression to invoke UDFs.
>>
>> On Tue, Feb 9, 2021 at 2:52 AM Ryan Blue <blue@apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to start a discussion for adding a FunctionCatalog interface to
>>> catalog plugins. This will allow catalogs to expose functions to Spark,
>>> similar to how the TableCatalog interface allows a catalog to expose
>>> tables. The proposal doc is available here:
>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>
>>> Here's a high-level summary of some of the main design choices:
>>> * Adds the ability to list and load functions, not to create or modify
>>> them in an external catalog
>>> * Supports scalar, aggregate, and partial aggregate functions
>>> * Uses load and bind steps for better error messages and simpler
>>> implementations
>>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>>> data
>>> * Can be extended using mix-in interfaces to add vectorization, codegen,
>>> and other future features
>>>
>>> There is also a PR with the proposed API:
>>> https://github.com/apache/spark/pull/24559/files
>>>
>>> Let's discuss the proposal here rather than on that PR, to get better
>>> visibility. Also, please take the time to read the proposal first. That
>>> really helps clear up misconceptions.
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>>
>> --
> Ryan Blue
>

Mime
View raw message