spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chao Sun <sunc...@apache.org>
Subject Re: [DISCUSS] SPIP: FunctionCatalog
Date Fri, 26 Feb 2021 23:37:57 GMT
Correct me if I'm wrong, but it appears we've basically agreed upon the
APIs proposed in the SPIP (forget the naming part):

interface ScalarFunction extends BoundFunction<R> {
  R produceResult(InternalRow args);
}
interface AggregateFunction<S, R> extends BoundFunction<R> {
  S update(S state, InternalRow input);
}

together with the rest of the design such as FunctionCatalog and binding
process.

The argument at the moment seems to be whether we want to have
SupportsInvoke or [Scalar|Aggregate]FunctionN alongside these, is that
correct?
 In order to move this forward, perhaps we can *merge the PR as it is* (maybe
we'll need a vote?) and proceed to discuss these topics? We can also then
present separate PRs on top of it, which can help a lot for people within
this thread to provide comments.

WDYT?

Best,
Chao

On Wed, Feb 24, 2021 at 10:45 PM Wenchen Fan <cloud0fan@gmail.com> wrote:

> I think there is one agreement between us: we need both the
> individual-parameters and row-parameter APIs(your SupportsInvoke proposal
> and my VarargsScalarFunction proposal). IIUC the argument now is how to
> compose these 2 APIs.
>
> Your proposal is to put the row-parameter API in the base ScalaFunction
> interface, with an optional SupportsInvoke interface for the
> individual-parameters API. I don't like it because it promotes the
> row-parameter API and forces users to implement it, even if the users want
> to only use the individual-parameters API.
>
> My proposal is to leave the choice to the users. They can pick one from
> ScalarFunction0, ScalarFunction1, ..., VarargsScalarFunction.
>
> More replies below:
>
> > We agree that ScalarFunction and AggregateFunction can optionally define
> methods for Spark to directly call in codegen
>
> I don't think we agree with it. Whatever UDF API we choose at the end
> (either individual-parameters or row-parameter), both non-codegen and
> codegen code paths should just call these Java methods from the UDF API. It
> doesn't make sense to have different UDF APIs for non-codegen and codegen.
>
> > The second option is to introduce 9 or more interfaces to break out the
> fields of the input row, and an additional Object[] variation for varargs:
>
> My initial idea is to not have these 9 interfaces and fully rely on Java
> reflection. We can do some benchmark, if reflection is not that slow, I
> think we don't need to add these 9 interfaces. Preso UDF API takes the same
> approach. And one correction: my proposal is to use InternalRow for
> varargs UDF, not Object[].
>
> > Spark will need additional code to call the right method based on input,
> so it will either have 10 wrapper classes or a big match statement
>
> You can take a look at the Spark ScalaUDF
> <https://github.com/apache/spark/blob/v3.1.1-rc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala>
> expression. It has a big match statement for the non-codegen path, but the
> codegen path is much simpler because we can generate the exact Java code to
> call the specific UDF. I don't think it's a big problem, or we can use
> reflection in the non-codegen path to avoid the big match statement.
>
> > Spark is always going to essentially call the raw interface with no
> specific type parameters. As a result, incorrect types (like String) will
> compile but fail at runtime with ClassCastException.
>
> You seem to keep ignoring my proposal that we can check the UDF function
> signature at the analysis phase to make sure it matches the input types.
> And with codegen Spark can call the specific function to avoid boxing
> issues. If you missed my previous example, here is what the generated code
> looks like:
>
> double input1 = ...;
> double input2 = ...;
> DoubleAdd udf = ...;
> double res = udf.call(input1, input2);
>
> > I think that the InternalRow option is easier to build against because
> it provides at least some type checking when accessing values from the
> input row.
>
> I have a different opinion about this. If the input is string type but the
> UDF implementation calls `row.getLong(0)`, *it returns wrong data*, which
> is very bad. With the individual-parameters approach, if you implement UDF
> with `def call(input: Long)` but the input is string type, analyzer can
> detect it and fail the query.
>
> On Thu, Feb 25, 2021 at 6:48 AM Ryan Blue <rblue@netflix.com> wrote:
>
>> How functions are called is a really big element of this effort. I don’t
>> want to get in a position where we’ve started committing changes without
>> clear agreement on something so fundamental to the proposal. I’d like to
>> make sure we’re in agreement with a vote on the SPIP before committing
>> anything. That is, after all, the point of the SPIPs.
>>
>> If people think it would help to have an alternative API in a PR, then
>> that’s fine with me.
>>
>> Since that PR suggestion is intended to make it easier to understand the
>> technical details, I’ll try to summarize where we’re at now:
>>
>>    - We agree on the scope of adding FunctionCatalog to load functions
>>    - We agree with the FunctionCatalog methods and the function binding
>>    approach
>>    - We agree that a bound function will be either a ScalarFunction or
>>    an AggregateFunction (plus the mix-in PartialAggregateFunction)
>>    - We agree that values should be passed should be Spark’s internal
>>    representation to avoid translation
>>    - We agree that ScalarFunction and AggregateFunction can optionally
>>    define methods for Spark to directly call in codegen
>>
>> The disagreement is about how to call functions when codegen isn’t used
>> or when the function needs to support variable-length argument lists. There
>> are two options:
>>
>> The first option is for each function to have a method that accepts an
>> InternalRow, from the proposed SPIP:
>>
>> interface ScalarFunction extends BoundFunction<R> {
>>   R produceResult(InternalRow input);
>> }
>> interface AggregateFunction<S> extends BoundFunction<R> {
>>   S update(S state, InternalRow input);
>>   ...
>> }
>>
>> The second option is to introduce 9 or more interfaces to break out the
>> fields of the input row, and an additional Object[] variation for
>> varargs:
>>
>> interface ScalarFunction1<T1> extends BoundFunction<R> {
>>   R produceResult(T1 one);
>> }
>> interface ScalarFunction2<T1, T2> extends BoundFunction<R> {
>>   R produceResult(T1 one, T2 two);
>> }
>> ... 8 more ScalarFunction interfaces
>> interface ScalarFunctionVarargs extends BoundFunction<R> {
>>   R produceResult(Object[] args);
>> }
>> interface AggregateFunction<S, T1> extends BoundFunction<R> {
>>   S update(S state, T1 one);
>> }
>> interface AggregateFunction<S, T1, T2> extends BoundFunction<R> {
>>   S update(S state, T1 one, T2 two);
>> }
>> ... 8 more AggregateFunction interfaces
>> interface AggregateFunctionVarargs<S> extends BoundFunction<R> {
>>   S update(S state, Object[] args);
>> }
>>
>> Because this is for the non-invoke case, the two options have roughly the
>> same performance characteristics.
>>
>> The first option has some advantages:
>>
>>    - It is simpler: there are few interfaces and Spark will always find
>>    the right method
>>    - Accessing a value returns a concrete type, so it is less
>>    error-prone. I’ve given an example where this helps identify a problem with
>>    an invoke method.
>>
>> The second option’s advantage is that users have values broken out into
>> arguments. That is, if I understand Wenchen correctly here: “I don’t like
>> the SupportsInvoke approach as it still promotes the row-parameter API. I
>> think the individual-parameters API is better for UDF developers.”
>>
>> Disadvantages with the second option:
>>
>>    - There are 20+ more interfaces in the API
>>    - Spark will need additional code to call the right method based on
>>    input, so it will either have 10 wrapper classes or a big match statement
>>    that calls each interface separately (see UDFRegistration
>>    <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala>
>>    ).
>>    - Spark is always going to essentially call the raw interface with no
>>    specific type parameters. As a result, incorrect types (like String)
>>    will compile but fail at runtime with ClassCastException.
>>    - The varargs case will result in casting to expected types, which
>>    could also fail with ClassCastException
>>
>> I think that the InternalRow option is easier to build against because
>> it provides at least some type checking when accessing values from the
>> input row. You get compile-time checks when using the wrong type like this: String
>> val = input.getString(0); won’t compile.
>>
>> Another important thing to note is that although the original idea was to
>> keep the individual parameter approach simple, Wenchen has already
>> suggested passing arrays as Java arrays, like UTF8String[]. This adds to
>> the complexity of the overall solution and requires matching multiple
>> types. How would Spark know to pass UTF8String[] or ArrayData?
>>
>> If anyone disagrees with that summary, please point out where it’s
>> incorrect. But barring a major misunderstanding, I think the choice is
>> clear: the simpler approach that provides additional compile-time safety is
>> the right way to go.
>>
>> On Tue, Feb 23, 2021 at 1:48 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> +1, as I already proposed we can move forward with PRs
>>>
>>> > To move forward, how about we implement the function loading and
>>> binding first? Then we can have PRs for both the individual-parameters (I
>>> can take it) and row-parameter approaches, if we still can't reach a
>>> consensus at that time and need to see all the details.
>>>
>>> Ryan, can we focus on the function loading and binding part and get it
>>> committed first? I can also fork your branch and put everything together,
>>> but that might be too big to review.
>>>
>>> On Tue, Feb 23, 2021 at 4:35 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>>> wrote:
>>>
>>>> I've been still supporting Ryan's SPIP (original PR and its extension
>>>> proposal discussed here) because of its simplicity.
>>>>
>>>> According to this email thread context, I also understand the different
>>>> perspectives like Hyukjin's concerns about having multiple ways and
>>>> Wenchen's proposal and rationales.
>>>>
>>>> It looks like we need more discussion to reach an agreement. And the
>>>> technical details become more difficult to track because this is an email
>>>> thread.
>>>>
>>>> Although Ryan initially suggested discussing this on Apache email
>>>> thread instead of the PR, can we have a PR to discuss?
>>>>
>>>> Especially, Wenchen, could you make your PR based on Ryan's PR?
>>>>
>>>> If we collect the scattered ideas into a single PR, that would be
>>>> greatly helpful not only for further discussions, but also when we go on a
>>>> vote on Ryan's PR or Wenchen's PR.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Mon, Feb 22, 2021 at 1:16 AM Wenchen Fan <cloud0fan@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Walaa,
>>>>>
>>>>> Thanks for sharing this! The type signature stuff is already covered
>>>>> by the unbound UDF API, which specifies the input and output data types.
>>>>> The problem is how to check the method signature of the bound UDF. As you
>>>>> said, Java has type erasure and we can't check `List<String>` for example.
>>>>>
>>>>> My initial proposal is to do nothing and simply pass the Spark
>>>>> ArrayData, MapData, InternalRow to the UDF. This requires the UDF
>>>>> developers to ensure the type is matched, as they need to call something
>>>>> like `array.getLong(index)` with the corrected type name. It's as worse as
>>>>> the row-parameter version but seems fine as it only happens with nested
>>>>> types. And the type check is still done for the first level (the method
>>>>> signature must use ArrayData/MapData/InternalRow at least).
>>>>>
>>>>> We can allow more types in the future to make the type check better.
>>>>> It might be too detailed for this discussion thread but just put a few
>>>>> thoughts:
>>>>> 1. Java array doesn't do type erasure. We can use UTF8String[] for
>>>>> example if the input type is array of string.
>>>>> 2. For struct type, we can allow Java beans/Scala case classes if the
>>>>> field name and type match the type signature.
>>>>> 3. For map type, it's actually struct<keys: array<key_type>, values:
>>>>> array<value_type>>, so we can also allow Java beans/Scala case
>>>>> classes here.
>>>>>
>>>>> The general idea is to use stuff that can retain nested type
>>>>> information at compile-time, i.e. array, java bean, case classes.
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 22, 2021 at 3:47 PM Walaa Eldin Moustafa <
>>>>> wa.moustafa@gmail.com> wrote:
>>>>>
>>>>>> Wenchen, in Transport, users provide the input parameter signatures
>>>>>> and output parameter signature as part of the API. Compile-time checks are
>>>>>> done by parsing the type signatures and matching them to the type tree
>>>>>> received at compile-time. This also helps with inferring the concrete
>>>>>> output type.
>>>>>>
>>>>>> The specification in the UDF API looks like this:
>>>>>>
>>>>>>   @Override
>>>>>>   public List<String> getInputParameterSignatures() {
>>>>>>     return ImmutableList.of(
>>>>>>         "ARRAY(K)",
>>>>>>         "ARRAY(V)"
>>>>>>     );
>>>>>>   }
>>>>>>
>>>>>>   @Override
>>>>>>   public String getOutputParameterSignature() {
>>>>>>     return "MAP(K,V)";
>>>>>>   }
>>>>>>
>>>>>> The benefits of this type of type signature specification as opposed
>>>>>> to inferring types from Java type signatures given in the Java method are:
>>>>>>
>>>>>>    - For nested types, Java type erasure eliminates the information
>>>>>>    about nested types, so for something like my_function(List<String>
>>>>>>    a1, List<Integer> a2), the value of both a1.class or a2.class is
>>>>>>    just a List. However, we are planning to work around this in a
>>>>>>    future version
>>>>>>    <https://github.com/linkedin/transport/tree/transport-api-v1/transportable-udfs-examples/transportable-udfs-example-udfs/src/main/java/com/linkedin/transport/examples> in
>>>>>>    the case of Array and Map types. Struct types are discussed in the next
>>>>>>    point.
>>>>>>    - Without pre-code-generation there is no single Java type
>>>>>>    signature from which we can capture the Struct info. However, Struct info
>>>>>>    can be expressed in type signatures of the above type, e.g., ROW(FirstName
>>>>>>    VARCHAR, LastName VARCHAR).
>>>>>>
>>>>>> When a Transport UDF represents a Spark UDF, the type signatures are
>>>>>> matched against Spark native types, i.e., org.apache.spark.sql.types.{ArrayType,
>>>>>> MapType, StructType}, and primitive types. The function that
>>>>>> parses/compiles type signatures is found in AbstractTypeInference
>>>>>> <https://github.com/linkedin/transport/blob/master/transportable-udfs-type-system/src/main/java/com/linkedin/transport/typesystem/AbstractTypeInference.java>. This
>>>>>> class represents the generic component that is common between all supported
>>>>>> engines. Its Spark-specific extension is in SparkTypeInference
>>>>>> <https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/typesystem/SparkTypeInference.scala>.
>>>>>> In the above example, at compile time, if the first Array happens to be of
>>>>>> String element type, and the second Array happens to be of Integer element
>>>>>> type, the UDF will communicate to the Spark analyzer that the output should
>>>>>> be of type MapData<String, Integer> (i.e., based on what was seen in
>>>>>> the input at compile time). The whole UDF becomes a Spark Expression
>>>>>> <https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/StdUdfWrapper.scala#L24>
>>>>>> at the end of the day.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>> On Sun, Feb 21, 2021 at 7:26 PM Wenchen Fan <cloud0fan@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think I have made it clear that it's simpler for the UDF
>>>>>>> developers to deal with the input parameters directly, instead of getting
>>>>>>> them from a row, as you need to provide the index and type (e.g.
>>>>>>> row.getLong(0)). It's also coherent with the existing Spark
>>>>>>> Scala/Java UDF APIs, so that Spark users will be more familiar with the
>>>>>>> individual-parameters API.
>>>>>>>
>>>>>>> And I have explained it already that we can use reflection to make
>>>>>>> sure the defined methods have the right types at query-compilation time.
>>>>>>> It's better than leaving this problem to the UDF developers and asking them
>>>>>>> to ensure the inputs are gotten from the row correctly with index and type.
>>>>>>> If there are people from Presto/Transport, it will be great if you can
>>>>>>> share how Presto/Transport do this check.
>>>>>>>
>>>>>>> I don't like 22 additional interfaces too, but if you look at the
>>>>>>> examples I gave, the current Spark Java UDF
>>>>>>> <https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java> only
>>>>>>> has 9 interfaces, and Transport
>>>>>>> <https://github.com/linkedin/transport/tree/master/transportable-udfs-api/src/main/java/com/linkedin/transport/api/udf> has
>>>>>>> 8. I think it's good enough and people can use VarargsScalarFunction if
>>>>>>> they need to take more parameters or varargs. It resolves your concern
>>>>>>> about doing reflection in the non-codegen execution path that leads to bad
>>>>>>> performance, it also serves for documentation purpose as people can easily
>>>>>>> see the number of UDF inputs and their types by a quick glance.
>>>>>>>
>>>>>>> As I said, we need to investigate how to avoid boxing. Since you are
>>>>>>> asking the question now, I spent sometime to think about it. I think the
>>>>>>> DoubleAdd example is the way to go. For non-codegen code path, we
>>>>>>> can just call the interface method. For the codegen code path, the
>>>>>>> generated Java code would look like (omit the null check logic):
>>>>>>>
>>>>>>> double input1 = ...;
>>>>>>> double input2 = ...;
>>>>>>> DoubleAdd udf = ...;
>>>>>>> double res = udf.call(input1, input2);
>>>>>>>
>>>>>>> Which invokes the primitive version automatically. AFAIK this is
>>>>>>> also how Scala supports primitive type parameter (generate an extra
>>>>>>> non-boxing version of the method). If the UDF doesn't have the primtive
>>>>>>> version method, this code will just call the boxed version and still works.
>>>>>>>
>>>>>>> I don't like the SupportsInvoke approach as it still promotes the
>>>>>>> row-parameter API. I think the individual-parameters API is better for UDF
>>>>>>> developers. Can other people share your opinions about the API?
>>>>>>>
>>>>>>> On Sat, Feb 20, 2021 at 5:32 AM Ryan Blue <rblue@netflix.com> wrote:
>>>>>>>
>>>>>>>> I don’t see any benefit to more complexity with 22 additional
>>>>>>>> interfaces, instead of simply passing an InternalRow. Why not use
>>>>>>>> a single interface with InternalRow? Maybe you could share your
>>>>>>>> motivation?
>>>>>>>>
>>>>>>>> That would also result in strange duplication, where the
>>>>>>>> ScalarFunction2 method is just the boxed version:
>>>>>>>>
>>>>>>>> class DoubleAdd implements ScalarFunction2<Double, Double, Double> {
>>>>>>>>   @Override
>>>>>>>>   Double produceResult(Double left, Double right) {
>>>>>>>>     return left + right;
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   double produceResult(double left, double right) {
>>>>>>>>     return left + right;
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> This would work okay, but would be awkward if you wanted to use the
>>>>>>>> same implementation for any number of arguments, like a sum method
>>>>>>>> that adds all of the arguments together and returns the result. It also
>>>>>>>> isn’t great for varargs, since it is basically the same as the invoke case.
>>>>>>>>
>>>>>>>> The combination of an InternalRow method and the invoke method
>>>>>>>> seems to be a good way to handle this to me. What is wrong with it? And,
>>>>>>>> how would you solve the problem when implementations define methods with
>>>>>>>> the wrong types? The InternalRow approach helps implementations
>>>>>>>> catch that problem (as demonstrated above) and also provides a fallback
>>>>>>>> when there is a but preventing the invoke optimization from working. That
>>>>>>>> seems like a good approach to me.
>>>>>>>>
>>>>>>>> On Thu, Feb 18, 2021 at 11:31 PM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> If people have such a big concern about reflection, we can follow
>>>>>>>>> the current Spark Java UDF
>>>>>>>>> <https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java>
>>>>>>>>> and Transport
>>>>>>>>> <https://github.com/linkedin/transport/tree/master/transportable-udfs-api/src/main/java/com/linkedin/transport/api/udf>,
>>>>>>>>> and create ScalarFuncion0[R], ScalarFuncion1[T1, R], etc. to
>>>>>>>>> avoid reflection. But we may need to investigate how to avoid boxing with
>>>>>>>>> this API design.
>>>>>>>>>
>>>>>>>>> To put a detailed proposal: let's have ScalarFuncion0,
>>>>>>>>> ScalarFuncion1, ..., ScalarFuncion9 and VarargsScalarFunction. At
>>>>>>>>> execution time, if Spark sees ScalarFuncion0-9, pass the input
>>>>>>>>> columns to the UDF directly, one column one parameter. So string type input
>>>>>>>>> is UTF8String, array type input is ArrayData. If Spark sees
>>>>>>>>> VarargsScalarFunction, wrap the input columns with InternalRow
>>>>>>>>> and pass it to the UDF.
>>>>>>>>>
>>>>>>>>> In general, if VarargsScalarFunction is implemented, the UDF
>>>>>>>>> should not implement ScalarFuncion0-9. We can also define a
>>>>>>>>> priority order to allow this. I don't have a strong preference here.
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> On Fri, Feb 19, 2021 at 1:24 PM Walaa Eldin Moustafa <
>>>>>>>>> wa.moustafa@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I agree with Ryan on the questions around the expressivity of the
>>>>>>>>>> Invoke method. It is not clear to me how the Invoke method can be used to
>>>>>>>>>> declare UDFs with type-parameterized parameters. For example: a UDF to get
>>>>>>>>>> the Nth element of an array (regardless of the Array element type) or a UDF
>>>>>>>>>> to merge two Arrays (of generic types) to a Map.
>>>>>>>>>>
>>>>>>>>>> Also, to address Wenchen's InternalRow question, can we create a
>>>>>>>>>> number of Function classes, each corresponding to a number of input
>>>>>>>>>> parameter length (e.g., ScalarFunction1, ScalarFunction2, etc)?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 18, 2021 at 6:07 PM Ryan Blue
>>>>>>>>>> <rblue@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> I agree with you that it is better in many cases to directly
>>>>>>>>>>> call a method. But it it not better in all cases, which is why I don’t
>>>>>>>>>>> think it is the right general-purpose choice.
>>>>>>>>>>>
>>>>>>>>>>> First, if codegen isn’t used for some reason, the reflection
>>>>>>>>>>> overhead is really significant. That gets much better when you have an
>>>>>>>>>>> interface to call. That’s one reason I’d use this pattern:
>>>>>>>>>>>
>>>>>>>>>>> class DoubleAdd implements ScalarFunction<Double>, SupportsInvoke {
>>>>>>>>>>>   Double produceResult(InternalRow row) {
>>>>>>>>>>>     return produceResult(row.getDouble(0), row.getDouble(1));
>>>>>>>>>>>   }
>>>>>>>>>>>
>>>>>>>>>>>   double produceResult(double left, double right) {
>>>>>>>>>>>     return left + right;
>>>>>>>>>>>   }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> There’s little overhead to adding the InternalRow variation,
>>>>>>>>>>> but we could call it in eval to avoid the reflect overhead. To
>>>>>>>>>>> the point about UDF developers, I think this is a reasonable cost.
>>>>>>>>>>>
>>>>>>>>>>> Second, I think usability is better and helps avoid runtime
>>>>>>>>>>> issues. Here’s an example:
>>>>>>>>>>>
>>>>>>>>>>> class StrLen implements ScalarFunction<Integer>, SupportsInvoke {
>>>>>>>>>>>   Integer produceResult(InternalRow row) {
>>>>>>>>>>>     return produceResult(row.getString(0));
>>>>>>>>>>>   }
>>>>>>>>>>>
>>>>>>>>>>>   Integer produceResult(String str) {
>>>>>>>>>>>     return str.length();
>>>>>>>>>>>   }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> See the bug? I forgot to use UTF8String instead of String. With
>>>>>>>>>>> the InternalRow method, I get a compiler warning because
>>>>>>>>>>> getString produces UTF8String that can’t be passed to
>>>>>>>>>>> produceResult(String). If I decided to implement length
>>>>>>>>>>> separately, then we could still run the InternalRow version and
>>>>>>>>>>> log a warning. The code would be slightly slower, but wouldn’t fail.
>>>>>>>>>>>
>>>>>>>>>>> There are similar situations with varargs where it’s better to
>>>>>>>>>>> call methods that produce concrete types than to cast from
>>>>>>>>>>> Object to some expected type.
>>>>>>>>>>>
>>>>>>>>>>> I think that using invoke is a great extension to the proposal,
>>>>>>>>>>> but I don’t think that it should be the only way to call functions. By all
>>>>>>>>>>> means, let’s work on it in parallel and use it where possible. But I think
>>>>>>>>>>> we do need the fallback of using InternalRow and that it isn’t
>>>>>>>>>>> a usability problem to include it.
>>>>>>>>>>>
>>>>>>>>>>> Oh, and one last thought is that we already have users that call
>>>>>>>>>>> Dataset.map and use InternalRow. This would allow converting that code
>>>>>>>>>>> directly to a UDF.
>>>>>>>>>>>
>>>>>>>>>>> I think we’re closer to agreeing here than it actually looks.
>>>>>>>>>>> Hopefully you’ll agree that having the InternalRow method isn’t
>>>>>>>>>>> a big usability problem.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 17, 2021 at 11:51 PM Wenchen Fan <
>>>>>>>>>>> cloud0fan@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I don't see any objections to the rest of the proposal (loading
>>>>>>>>>>>> functions from the catalog, function binding stuff, etc.) and I assume
>>>>>>>>>>>> everyone is OK with it. We can commit that part first.
>>>>>>>>>>>>
>>>>>>>>>>>> Currently, the discussion focuses on the `ScalarFunction` API,
>>>>>>>>>>>> where I think it's better to directly take the input columns as the UDF
>>>>>>>>>>>> parameter, instead of wrapping the input columns with
>>>>>>>>>>>> InternalRow and taking the InternalRow as the UDF parameter.
>>>>>>>>>>>> It's not only for better performance, but also for ease of use. For
>>>>>>>>>>>> example, it's easier for the UDF developer to write `input1 +
>>>>>>>>>>>> input2` than `inputRow.getLong(0) + inputRow.getLong(1)`, as
>>>>>>>>>>>> they don't need to specify the type and index by themselves (
>>>>>>>>>>>> getLong(0)) which is error-prone.
>>>>>>>>>>>>
>>>>>>>>>>>> It does push more work to the Spark side, but I think it's
>>>>>>>>>>>> worth it if implementing UDF gets easier. I don't think the work is very
>>>>>>>>>>>> challenging, as we can leverage the infra we built for the expression
>>>>>>>>>>>> encoder.
>>>>>>>>>>>>
>>>>>>>>>>>> I think it's also important to look at the UDF API from the
>>>>>>>>>>>> user's perspective (UDF developers). How do you like the UDF API without
>>>>>>>>>>>> considering how Spark can support it? Do you prefer the
>>>>>>>>>>>> individual-parameters version or the row-parameter version?
>>>>>>>>>>>>
>>>>>>>>>>>> To move forward, how about we implement the function loading
>>>>>>>>>>>> and binding first? Then we can have PRs for both the individual-parameters
>>>>>>>>>>>> (I can take it) and row-parameter approaches, if we still can't reach a
>>>>>>>>>>>> consensus at that time and need to see all the details.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 18, 2021 at 4:48 AM Ryan Blue
>>>>>>>>>>>> <rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Hyukjin. I think that's a fair summary. And I agree
>>>>>>>>>>>>> with the idea that we should focus on what Spark will do by default.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think we should focus on the proposal, for two reasons:
>>>>>>>>>>>>> first, there is a straightforward path to incorporate Wenchen's suggestion
>>>>>>>>>>>>> via `SupportsInvoke`, and second, the proposal is more complete: it defines
>>>>>>>>>>>>> a solution for many concerns like loading a function and finding out what
>>>>>>>>>>>>> types to use -- not just how to call code -- and supports more use cases
>>>>>>>>>>>>> like varargs functions. I think we can continue to discuss the rest of the
>>>>>>>>>>>>> proposal and be confident that we can support an invoke code path where it
>>>>>>>>>>>>> makes sense.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does everyone agree? If not, I think we would need to solve a
>>>>>>>>>>>>> lot of the challenges that I initially brought up with the invoke idea. It
>>>>>>>>>>>>> seems like a good way to call a function, but needs a real proposal behind
>>>>>>>>>>>>> it if we don't use it via `SupportsInvoke` in the current proposal.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 16, 2021 at 11:07 PM Hyukjin Kwon <
>>>>>>>>>>>>> gurwls223@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to make sure we don’t move past, I think we haven’t
>>>>>>>>>>>>>> decided yet:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - if we’ll replace the current proposal to Wenchen’s
>>>>>>>>>>>>>>    approach as the default
>>>>>>>>>>>>>>    - if we want to have Wenchen’s approach as an optional
>>>>>>>>>>>>>>    mix-in on the top of Ryan’s proposal (SupportsInvoke)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> From what I read, some people pointed out it as a
>>>>>>>>>>>>>> replacement. Please correct me if I misread this discussion thread.
>>>>>>>>>>>>>> As Dongjoon pointed out, it would be good to know rough ETA
>>>>>>>>>>>>>> to make sure making progress in this, and people can compare more easily.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FWIW, there’s the saying I like in the zen of Python
>>>>>>>>>>>>>> <https://www.python.org/dev/peps/pep-0020/>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There should be one— and preferably only one —obvious way to
>>>>>>>>>>>>>> do it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If multiple approaches have the way for developers to do the
>>>>>>>>>>>>>> (almost) same thing, I would prefer to avoid it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In addition, I would prefer to focus on what Spark does by
>>>>>>>>>>>>>> default first.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun <
>>>>>>>>>>>>>> dongjoon.hyun@gmail.com>님이 작성:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi, Wenchen.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This thread seems to get enough attention. Also, I'm
>>>>>>>>>>>>>>> expecting more and more if we have this on the `master` branch because we
>>>>>>>>>>>>>>> are developing together.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     > Spark SQL has many active contributors/committers and
>>>>>>>>>>>>>>> this thread doesn't get much attention yet.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So, what's your ETA from now?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     > I think the problem here is we were discussing some
>>>>>>>>>>>>>>> very detailed things without actual code.
>>>>>>>>>>>>>>>     > I'll implement my idea after the holiday and then we
>>>>>>>>>>>>>>> can have more effective discussions.
>>>>>>>>>>>>>>>     > We can also do benchmarks and get some real numbers.
>>>>>>>>>>>>>>>     > In the meantime, we can continue to discuss other
>>>>>>>>>>>>>>> parts of this proposal, and make a prototype if possible.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm looking forward to seeing your PR. I hope we can
>>>>>>>>>>>>>>> conclude this thread and have at least one implementation in the `master`
>>>>>>>>>>>>>>> branch this month (February).
>>>>>>>>>>>>>>> If you need more time (one month or longer), why don't we
>>>>>>>>>>>>>>> have Ryan's suggestion in the `master` branch first and benchmark with your
>>>>>>>>>>>>>>> PR later during Apache Spark 3.2 timeframe.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue
>>>>>>>>>>>>>>> <rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andrew,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The proposal already includes an API for aggregate
>>>>>>>>>>>>>>>> functions and I think we would want to implement those right away.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Processing ColumnBatch is something we can easily extend
>>>>>>>>>>>>>>>> the interfaces to support, similar to Wenchen's suggestion. The important
>>>>>>>>>>>>>>>> thing right now is to agree on some basic functionality: how to look up
>>>>>>>>>>>>>>>> functions and what the simple API should be. Like the TableCatalog
>>>>>>>>>>>>>>>> interfaces, we will layer on more support through optional interfaces like
>>>>>>>>>>>>>>>> `SupportsInvoke` or `SupportsColumnBatch`.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo <
>>>>>>>>>>>>>>>> andrew.melo@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello Ryan,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This proposal looks very interesting. Would future goals
>>>>>>>>>>>>>>>>> for this
>>>>>>>>>>>>>>>>> functionality include both support for aggregation
>>>>>>>>>>>>>>>>> functions, as well
>>>>>>>>>>>>>>>>> as support for processing ColumnBatch-es (instead of
>>>>>>>>>>>>>>>>> Row/InternalRow)?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue
>>>>>>>>>>>>>>>>> <rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Thanks for the positive feedback, everyone. It sounds
>>>>>>>>>>>>>>>>> like there is a clear path forward for calling functions. Even without a
>>>>>>>>>>>>>>>>> prototype, the `invoke` plans show that Wenchen's suggested optimization
>>>>>>>>>>>>>>>>> can be done, and incorporating it as an optional extension to this proposal
>>>>>>>>>>>>>>>>> solves many of the unknowns.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > With that area now understood, is there any discussion
>>>>>>>>>>>>>>>>> about other parts of the proposal, besides the function call interface?
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun <
>>>>>>>>>>>>>>>>> sunchao@apache.org> wrote:
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> This is an important feature which can unblock several
>>>>>>>>>>>>>>>>> other projects including bucket join support for DataSource v2, complete
>>>>>>>>>>>>>>>>> support for enforcing DataSource v2 distribution requirements on the write
>>>>>>>>>>>>>>>>> path, etc. I like Ryan's proposals which look simple and elegant, with nice
>>>>>>>>>>>>>>>>> support on function overloading and variadic arguments. On the other hand,
>>>>>>>>>>>>>>>>> I think Wenchen made a very good point about performance. Overall, I'm
>>>>>>>>>>>>>>>>> excited to see active discussions on this topic and believe the community
>>>>>>>>>>>>>>>>> will come to a proposal with the best of both sides.
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Chao
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <
>>>>>>>>>>>>>>>>> gurwls223@gmail.com> wrote:
>>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>>> >>> +1 for Liang-chi's.
>>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh <
>>>>>>>>>>>>>>>>> viirya@gmail.com>님이 작성:
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> Basically I think the proposal makes sense to me and
>>>>>>>>>>>>>>>>> I'd like to support the
>>>>>>>>>>>>>>>>> >>>> SPIP as it looks like we have strong need for the
>>>>>>>>>>>>>>>>> important feature.
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> Thanks Ryan for working on this and I do also look
>>>>>>>>>>>>>>>>> forward to Wenchen's
>>>>>>>>>>>>>>>>> >>>> implementation. Thanks for the discussion too.
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan
>>>>>>>>>>>>>>>>> looks a good
>>>>>>>>>>>>>>>>> >>>> alternative to me. Besides Wenchen's alternative
>>>>>>>>>>>>>>>>> implementation, is there a
>>>>>>>>>>>>>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> John Zhuge wrote
>>>>>>>>>>>>>>>>> >>>> > Excited to see our Spark community rallying behind
>>>>>>>>>>>>>>>>> this important feature!
>>>>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>>>>> >>>> > The proposal lays a solid foundation of minimal
>>>>>>>>>>>>>>>>> feature set with careful
>>>>>>>>>>>>>>>>> >>>> > considerations for future optimizations and
>>>>>>>>>>>>>>>>> extensions. Can't wait to see
>>>>>>>>>>>>>>>>> >>>> > it leading to more advanced functionalities like
>>>>>>>>>>>>>>>>> views with shared custom
>>>>>>>>>>>>>>>>> >>>> > functions, function pushdown, lambda, etc. It has
>>>>>>>>>>>>>>>>> already borne fruit from
>>>>>>>>>>>>>>>>> >>>> > the constructive collaborations in this thread.
>>>>>>>>>>>>>>>>> Looking forward to
>>>>>>>>>>>>>>>>> >>>> > Wenchen's prototype and further discussions
>>>>>>>>>>>>>>>>> including the SupportsInvoke
>>>>>>>>>>>>>>>>> >>>> > extension proposed by Ryan.
>>>>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > owen.omalley@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>>>>> >>>> > wrote:
>>>>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>>>>> >>>> >> I think this proposal is a very good thing giving
>>>>>>>>>>>>>>>>> Spark a standard way of
>>>>>>>>>>>>>>>>> >>>> >> getting to and calling UDFs.
>>>>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>>>>> >>>> >> I like having the ScalarFunction as the API to
>>>>>>>>>>>>>>>>> call the UDFs. It is
>>>>>>>>>>>>>>>>> >>>> >> simple, yet covers all of the polymorphic type
>>>>>>>>>>>>>>>>> cases well. I think it
>>>>>>>>>>>>>>>>> >>>> >> would
>>>>>>>>>>>>>>>>> >>>> >> also simplify using the functions in other
>>>>>>>>>>>>>>>>> contexts like pushing down
>>>>>>>>>>>>>>>>> >>>> >> filters into the ORC & Parquet readers although
>>>>>>>>>>>>>>>>> there are a lot of
>>>>>>>>>>>>>>>>> >>>> >> details
>>>>>>>>>>>>>>>>> >>>> >> that would need to be considered there.
>>>>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>>>>> >>>> >> .. Owen
>>>>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > ekrogen@.com
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>>>>> >>>> >> wrote:
>>>>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>>>>> >>>> >>> I agree that there is a strong need for a
>>>>>>>>>>>>>>>>> FunctionCatalog within Spark
>>>>>>>>>>>>>>>>> >>>> >>> to
>>>>>>>>>>>>>>>>> >>>> >>> provide support for shareable UDFs, as well as
>>>>>>>>>>>>>>>>> make movement towards
>>>>>>>>>>>>>>>>> >>>> >>> more
>>>>>>>>>>>>>>>>> >>>> >>> advanced functionality like views which
>>>>>>>>>>>>>>>>> themselves depend on UDFs, so I
>>>>>>>>>>>>>>>>> >>>> >>> support this SPIP wholeheartedly.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> I find both of the proposed UDF APIs to be
>>>>>>>>>>>>>>>>> sufficiently user-friendly
>>>>>>>>>>>>>>>>> >>>> >>> and
>>>>>>>>>>>>>>>>> >>>> >>> extensible. I generally think Wenchen's proposal
>>>>>>>>>>>>>>>>> is easier for a user to
>>>>>>>>>>>>>>>>> >>>> >>> work with in the common case, but has greater
>>>>>>>>>>>>>>>>> potential for confusing
>>>>>>>>>>>>>>>>> >>>> >>> and
>>>>>>>>>>>>>>>>> >>>> >>> hard-to-debug behavior due to use of reflective
>>>>>>>>>>>>>>>>> method signature
>>>>>>>>>>>>>>>>> >>>> >>> searches.
>>>>>>>>>>>>>>>>> >>>> >>> The merits on both sides can hopefully be more
>>>>>>>>>>>>>>>>> properly examined with
>>>>>>>>>>>>>>>>> >>>> >>> code,
>>>>>>>>>>>>>>>>> >>>> >>> so I look forward to seeing an implementation of
>>>>>>>>>>>>>>>>> Wenchen's ideas to
>>>>>>>>>>>>>>>>> >>>> >>> provide
>>>>>>>>>>>>>>>>> >>>> >>> a more concrete comparison. I am optimistic that
>>>>>>>>>>>>>>>>> we will not let the
>>>>>>>>>>>>>>>>> >>>> >>> debate
>>>>>>>>>>>>>>>>> >>>> >>> over this point unreasonably stall the SPIP from
>>>>>>>>>>>>>>>>> making progress.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Thank you to both Wenchen and Ryan for your
>>>>>>>>>>>>>>>>> detailed consideration and
>>>>>>>>>>>>>>>>> >>>> >>> evaluation of these ideas!
>>>>>>>>>>>>>>>>> >>>> >>> ------------------------------
>>>>>>>>>>>>>>>>> >>>> >>> *From:* Dongjoon Hyun &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > dongjoon.hyun@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>>>>> >>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>>>>>>>>>>>>>>>> >>>> >>> *To:* Ryan Blue &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > blue@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>>>>> >>>> >>> *Cc:* Holden Karau &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > holden@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;; Hyukjin Kwon <
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > gurwls223@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> >>; Spark Dev List &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > dev@.apache
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;; Wenchen Fan
>>>>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > cloud0fan@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>>>>> >>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> BTW, I forgot to add my opinion explicitly in
>>>>>>>>>>>>>>>>> this thread because I was
>>>>>>>>>>>>>>>>> >>>> >>> on the PR before this thread.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> 1. The `FunctionCatalog API` PR was made on May
>>>>>>>>>>>>>>>>> 9, 2019 and has been
>>>>>>>>>>>>>>>>> >>>> >>> there for almost two years.
>>>>>>>>>>>>>>>>> >>>> >>> 2. I already gave my +1 on that PR last Saturday
>>>>>>>>>>>>>>>>> because I agreed with
>>>>>>>>>>>>>>>>> >>>> >>> the latest updated design docs and AS-IS PR.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> And, the rest of the progress in this thread is
>>>>>>>>>>>>>>>>> also very satisfying to
>>>>>>>>>>>>>>>>> >>>> >>> me.
>>>>>>>>>>>>>>>>> >>>> >>> (e.g. Ryan's extension suggestion and Wenchen's
>>>>>>>>>>>>>>>>> alternative)
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> To All:
>>>>>>>>>>>>>>>>> >>>> >>> Please take a look at the design doc and the PR,
>>>>>>>>>>>>>>>>> and give us some
>>>>>>>>>>>>>>>>> >>>> >>> opinions.
>>>>>>>>>>>>>>>>> >>>> >>> We really need your participation in order to
>>>>>>>>>>>>>>>>> make DSv2 more complete.
>>>>>>>>>>>>>>>>> >>>> >>> This will unblock other DSv2 features, too.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Bests,
>>>>>>>>>>>>>>>>> >>>> >>> Dongjoon.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun
>>>>>>>>>>>>>>>>> &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > dongjoon.hyun@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>>>>> >>>> >>> wrote:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Hi, Ryan.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> We didn't move past anything (both yours and
>>>>>>>>>>>>>>>>> Wenchen's). What Wenchen
>>>>>>>>>>>>>>>>> >>>> >>> suggested is double-checking the alternatives
>>>>>>>>>>>>>>>>> with the implementation to
>>>>>>>>>>>>>>>>> >>>> >>> give more momentum to our discussion.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Your new suggestion about optional extention also
>>>>>>>>>>>>>>>>> sounds like a new
>>>>>>>>>>>>>>>>> >>>> >>> reasonable alternative to me.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> We are still discussing this topic together and I
>>>>>>>>>>>>>>>>> hope we can make a
>>>>>>>>>>>>>>>>> >>>> >>> conclude at this time (for Apache Spark 3.2)
>>>>>>>>>>>>>>>>> without being stucked like
>>>>>>>>>>>>>>>>> >>>> >>> last time.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> I really appreciate your leadership in this
>>>>>>>>>>>>>>>>> dicsussion and the moving
>>>>>>>>>>>>>>>>> >>>> >>> direction of this discussion looks constructive
>>>>>>>>>>>>>>>>> to me. Let's give some
>>>>>>>>>>>>>>>>> >>>> >>> time
>>>>>>>>>>>>>>>>> >>>> >>> to the alternatives.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Bests,
>>>>>>>>>>>>>>>>> >>>> >>> Dongjoon.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > blue@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> I don’t think we should so quickly move past the
>>>>>>>>>>>>>>>>> drawbacks of this
>>>>>>>>>>>>>>>>> >>>> >>> approach. The problems are significant enough
>>>>>>>>>>>>>>>>> that using invoke is not
>>>>>>>>>>>>>>>>> >>>> >>> sufficient on its own. But, I think we can add it
>>>>>>>>>>>>>>>>> as an optional
>>>>>>>>>>>>>>>>> >>>> >>> extension
>>>>>>>>>>>>>>>>> >>>> >>> to shore up the weaknesses.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Here’s a summary of the drawbacks:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>    - Magic function signatures are error-prone
>>>>>>>>>>>>>>>>> >>>> >>>    - Spark would need considerable code to help
>>>>>>>>>>>>>>>>> users find what went
>>>>>>>>>>>>>>>>> >>>> >>>    wrong
>>>>>>>>>>>>>>>>> >>>> >>>    - Spark would likely need to coerce arguments
>>>>>>>>>>>>>>>>> (e.g., String,
>>>>>>>>>>>>>>>>> >>>> >>>    Option[Int]) for usability
>>>>>>>>>>>>>>>>> >>>> >>>    - It is unclear how Spark will find the Java
>>>>>>>>>>>>>>>>> Method to call
>>>>>>>>>>>>>>>>> >>>> >>>    - Use cases that require varargs fall back to
>>>>>>>>>>>>>>>>> casting; users will
>>>>>>>>>>>>>>>>> >>>> >>>    also get this wrong (cast to String instead of
>>>>>>>>>>>>>>>>> UTF8String)
>>>>>>>>>>>>>>>>> >>>> >>>    - The non-codegen path is significantly slower
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> The benefit of invoke is to avoid moving data
>>>>>>>>>>>>>>>>> into a row, like this:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> -- using invoke
>>>>>>>>>>>>>>>>> >>>> >>> int result = udfFunction(x, y)
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> -- using row
>>>>>>>>>>>>>>>>> >>>> >>> udfRow.update(0, x); -- actual: values[0] = x;
>>>>>>>>>>>>>>>>> >>>> >>> udfRow.update(1, y);
>>>>>>>>>>>>>>>>> >>>> >>> int result = udfFunction(udfRow);
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> And, again, that won’t actually help much in
>>>>>>>>>>>>>>>>> cases that require varargs.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> I suggest we add a new marker trait for
>>>>>>>>>>>>>>>>> BoundMethod called
>>>>>>>>>>>>>>>>> >>>> >>> SupportsInvoke.
>>>>>>>>>>>>>>>>> >>>> >>> If that interface is implemented, then Spark will
>>>>>>>>>>>>>>>>> look for a method that
>>>>>>>>>>>>>>>>> >>>> >>> matches the expected signature based on the bound
>>>>>>>>>>>>>>>>> input type. If it
>>>>>>>>>>>>>>>>> >>>> >>> isn’t
>>>>>>>>>>>>>>>>> >>>> >>> found, Spark can print a warning and fall back to
>>>>>>>>>>>>>>>>> the InternalRow call:
>>>>>>>>>>>>>>>>> >>>> >>> “Cannot find udfFunction(int, int)”.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> This approach allows the invoke optimization, but
>>>>>>>>>>>>>>>>> solves many of the
>>>>>>>>>>>>>>>>> >>>> >>> problems:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>    - The method to invoke is found using the
>>>>>>>>>>>>>>>>> proposed load and bind
>>>>>>>>>>>>>>>>> >>>> >>>    approach
>>>>>>>>>>>>>>>>> >>>> >>>    - Magic function signatures are optional and
>>>>>>>>>>>>>>>>> do not cause runtime
>>>>>>>>>>>>>>>>> >>>> >>>    failures
>>>>>>>>>>>>>>>>> >>>> >>>    - Because this is an optional optimization,
>>>>>>>>>>>>>>>>> Spark can be more strict
>>>>>>>>>>>>>>>>> >>>> >>>    about types
>>>>>>>>>>>>>>>>> >>>> >>>    - Varargs cases can still use rows
>>>>>>>>>>>>>>>>> >>>> >>>    - Non-codegen can use an evaluation method
>>>>>>>>>>>>>>>>> rather than falling back
>>>>>>>>>>>>>>>>> >>>> >>>    to slow Java reflection
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> This seems like a good extension to me; this
>>>>>>>>>>>>>>>>> provides a plan for
>>>>>>>>>>>>>>>>> >>>> >>> optimizing the UDF call to avoid building a row,
>>>>>>>>>>>>>>>>> while the existing
>>>>>>>>>>>>>>>>> >>>> >>> proposal covers the other cases well and
>>>>>>>>>>>>>>>>> addresses how to locate these
>>>>>>>>>>>>>>>>> >>>> >>> function calls.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> This also highlights that the approach used in
>>>>>>>>>>>>>>>>> DSv2 and this proposal is
>>>>>>>>>>>>>>>>> >>>> >>> working: start small and use extensions to layer
>>>>>>>>>>>>>>>>> on more complex
>>>>>>>>>>>>>>>>> >>>> >>> support.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > dongjoon.hyun@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>>>>> >>>> >>> wrote:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Thank you all for making a giant move forward for
>>>>>>>>>>>>>>>>> Apache Spark 3.2.0.
>>>>>>>>>>>>>>>>> >>>> >>> I'm really looking forward to seeing Wenchen's
>>>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>>> >>>> >>> That would be greatly helpful to make a decision!
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> > I'll implement my idea after the holiday and
>>>>>>>>>>>>>>>>> then we can have
>>>>>>>>>>>>>>>>> >>>> >>> more effective discussions. We can also do
>>>>>>>>>>>>>>>>> benchmarks and get some real
>>>>>>>>>>>>>>>>> >>>> >>> numbers.
>>>>>>>>>>>>>>>>> >>>> >>> > FYI: the Presto UDF API
>>>>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067978066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=iMWmHqqXPcT7EK%2Bovyzhy%2BZpU6Llih%2BwdZD53wvobmc%3D&amp;reserved=0&gt
>>>>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>>>>> >>>> >>> also
>>>>>>>>>>>>>>>>> >>>> >>> takes individual parameters instead of the row
>>>>>>>>>>>>>>>>> parameter. I think this
>>>>>>>>>>>>>>>>> >>>> >>> direction at least worth a try so that we can see
>>>>>>>>>>>>>>>>> the performance
>>>>>>>>>>>>>>>>> >>>> >>> difference. It's also mentioned in the design doc
>>>>>>>>>>>>>>>>> as an alternative
>>>>>>>>>>>>>>>>> >>>> >>> (Trino).
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Bests,
>>>>>>>>>>>>>>>>> >>>> >>> Dongjoon.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > cloud0fan@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> FYI: the Presto UDF API
>>>>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZSBCR7yx3PpwL4KY9V73JG42Z02ZodqkjxC0LweHt1g%3D&amp;reserved=0&gt
>>>>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>>>>> >>>> >>> also takes individual parameters instead of the
>>>>>>>>>>>>>>>>> row parameter. I think
>>>>>>>>>>>>>>>>> >>>> >>> this
>>>>>>>>>>>>>>>>> >>>> >>> direction at least worth a try so that we can see
>>>>>>>>>>>>>>>>> the performance
>>>>>>>>>>>>>>>>> >>>> >>> difference. It's also mentioned in the design doc
>>>>>>>>>>>>>>>>> as an alternative
>>>>>>>>>>>>>>>>> >>>> >>> (Trino).
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > cloud0fan@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Hi Holden,
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> As Hyukjin said, following existing designs is
>>>>>>>>>>>>>>>>> not the principle of DS
>>>>>>>>>>>>>>>>> >>>> >>> v2
>>>>>>>>>>>>>>>>> >>>> >>> API design. We should make sure the DS v2 API
>>>>>>>>>>>>>>>>> makes sense. AFAIK we
>>>>>>>>>>>>>>>>> >>>> >>> didn't
>>>>>>>>>>>>>>>>> >>>> >>> fully follow the catalog API design from Hive and
>>>>>>>>>>>>>>>>> I believe Ryan also
>>>>>>>>>>>>>>>>> >>>> >>> agrees with it.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> I think the problem here is we were discussing
>>>>>>>>>>>>>>>>> some very detailed things
>>>>>>>>>>>>>>>>> >>>> >>> without actual code. I'll implement my idea after
>>>>>>>>>>>>>>>>> the holiday and then
>>>>>>>>>>>>>>>>> >>>> >>> we
>>>>>>>>>>>>>>>>> >>>> >>> can have more effective discussions. We can also
>>>>>>>>>>>>>>>>> do benchmarks and get
>>>>>>>>>>>>>>>>> >>>> >>> some
>>>>>>>>>>>>>>>>> >>>> >>> real numbers.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> In the meantime, we can continue to discuss other
>>>>>>>>>>>>>>>>> parts of this
>>>>>>>>>>>>>>>>> >>>> >>> proposal,
>>>>>>>>>>>>>>>>> >>>> >>> and make a prototype if possible. Spark SQL has
>>>>>>>>>>>>>>>>> many active
>>>>>>>>>>>>>>>>> >>>> >>> contributors/committers and this thread doesn't
>>>>>>>>>>>>>>>>> get much attention yet.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > gurwls223@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Just dropping a few lines. I remember that one of
>>>>>>>>>>>>>>>>> the goals in DSv2 is
>>>>>>>>>>>>>>>>> >>>> >>> to
>>>>>>>>>>>>>>>>> >>>> >>> correct the mistakes we made in the current Spark
>>>>>>>>>>>>>>>>> codes.
>>>>>>>>>>>>>>>>> >>>> >>> It would not have much point if we will happen to
>>>>>>>>>>>>>>>>> just follow and mimic
>>>>>>>>>>>>>>>>> >>>> >>> what Spark currently does. It might just end up
>>>>>>>>>>>>>>>>> with another copy of
>>>>>>>>>>>>>>>>> >>>> >>> Spark
>>>>>>>>>>>>>>>>> >>>> >>> APIs, e.g. Expression (internal) APIs. I
>>>>>>>>>>>>>>>>> sincerely would like to avoid
>>>>>>>>>>>>>>>>> >>>> >>> this
>>>>>>>>>>>>>>>>> >>>> >>> I do believe we have been stuck mainly due to
>>>>>>>>>>>>>>>>> trying to come up with a
>>>>>>>>>>>>>>>>> >>>> >>> better design. We already have an ugly picture of
>>>>>>>>>>>>>>>>> the current Spark APIs
>>>>>>>>>>>>>>>>> >>>> >>> to
>>>>>>>>>>>>>>>>> >>>> >>> draw a better bigger picture.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > holden@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt;님이 작성:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> I think this proposal is a good set of trade-offs
>>>>>>>>>>>>>>>>> and has existed in the
>>>>>>>>>>>>>>>>> >>>> >>> community for a long period of time. I especially
>>>>>>>>>>>>>>>>> appreciate how the
>>>>>>>>>>>>>>>>> >>>> >>> design
>>>>>>>>>>>>>>>>> >>>> >>> is focused on a minimal useful component, with
>>>>>>>>>>>>>>>>> future optimizations
>>>>>>>>>>>>>>>>> >>>> >>> considered from a point of view of making sure
>>>>>>>>>>>>>>>>> it's flexible, but actual
>>>>>>>>>>>>>>>>> >>>> >>> concrete decisions left for the future once we
>>>>>>>>>>>>>>>>> see how this API is used.
>>>>>>>>>>>>>>>>> >>>> >>> I
>>>>>>>>>>>>>>>>> >>>> >>> think if we try and optimize everything right out
>>>>>>>>>>>>>>>>> of the gate, we'll
>>>>>>>>>>>>>>>>> >>>> >>> quickly get stuck (again) and not make any
>>>>>>>>>>>>>>>>> progress.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue &lt;
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > blue@
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Hi everyone,
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> I'd like to start a discussion for adding a
>>>>>>>>>>>>>>>>> FunctionCatalog interface to
>>>>>>>>>>>>>>>>> >>>> >>> catalog plugins. This will allow catalogs to
>>>>>>>>>>>>>>>>> expose functions to Spark,
>>>>>>>>>>>>>>>>> >>>> >>> similar to how the TableCatalog interface allows
>>>>>>>>>>>>>>>>> a catalog to expose
>>>>>>>>>>>>>>>>> >>>> >>> tables. The proposal doc is available here:
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U%2Fedit&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Kyth8%2FhNUZ6GXG2FsgcknZ7t7s0%2BpxnDMPyxvsxLLqE%3D&amp;reserved=0&gt
>>>>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Here's a high-level summary of some of the main
>>>>>>>>>>>>>>>>> design choices:
>>>>>>>>>>>>>>>>> >>>> >>> * Adds the ability to list and load functions,
>>>>>>>>>>>>>>>>> not to create or modify
>>>>>>>>>>>>>>>>> >>>> >>> them in an external catalog
>>>>>>>>>>>>>>>>> >>>> >>> * Supports scalar, aggregate, and partial
>>>>>>>>>>>>>>>>> aggregate functions
>>>>>>>>>>>>>>>>> >>>> >>> * Uses load and bind steps for better error
>>>>>>>>>>>>>>>>> messages and simpler
>>>>>>>>>>>>>>>>> >>>> >>> implementations
>>>>>>>>>>>>>>>>> >>>> >>> * Like the DSv2 table read and write APIs, it
>>>>>>>>>>>>>>>>> uses InternalRow to pass
>>>>>>>>>>>>>>>>> >>>> >>> data
>>>>>>>>>>>>>>>>> >>>> >>> * Can be extended using mix-in interfaces to add
>>>>>>>>>>>>>>>>> vectorization, codegen,
>>>>>>>>>>>>>>>>> >>>> >>> and other future features
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> There is also a PR with the proposed API:
>>>>>>>>>>>>>>>>> >>>> >>> https://github.com/apache/spark/pull/24559/files
>>>>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F24559%2Ffiles&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t3ZCqffdsrmCY3X%2FT8x1oMjMcNUiQ0wQNk%2ByAXQx1Io%3D&amp;reserved=0&gt
>>>>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> Let's discuss the proposal here rather than on
>>>>>>>>>>>>>>>>> that PR, to get better
>>>>>>>>>>>>>>>>> >>>> >>> visibility. Also, please take the time to read
>>>>>>>>>>>>>>>>> the proposal first. That
>>>>>>>>>>>>>>>>> >>>> >>> really helps clear up misconceptions.
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> --
>>>>>>>>>>>>>>>>> >>>> >>> Ryan Blue
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> --
>>>>>>>>>>>>>>>>> >>>> >>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D&amp;reserved=0&gt
>>>>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>>>>> >>>> >>> Books (Learning Spark, High Performance Spark,
>>>>>>>>>>>>>>>>> etc.):
>>>>>>>>>>>>>>>>> >>>> >>> https://amzn.to/2MaRAG9
>>>>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D&amp;reserved=0&gt
>>>>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>>>>> >>>> >>> YouTube Live Streams:
>>>>>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D&amp;reserved=0&gt
>>>>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>> --
>>>>>>>>>>>>>>>>> >>>> >>> Ryan Blue
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>>>>> >>>> > --
>>>>>>>>>>>>>>>>> >>>> > John Zhuge
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>> --
>>>>>>>>>>>>>>>>> >>>> Sent from:
>>>>>>>>>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>> >>>> To unsubscribe e-mail:
>>>>>>>>>>>>>>>>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > --
>>>>>>>>>>>>>>>>> > Ryan Blue
>>>>>>>>>>>>>>>>> > Software Engineer
>>>>>>>>>>>>>>>>> > Netflix
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Mime
View raw message