spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: [DISCUSS] SPIP: FunctionCatalog
Date Tue, 23 Feb 2021 09:48:02 GMT
+1, as I already proposed we can move forward with PRs

> To move forward, how about we implement the function loading and binding
first? Then we can have PRs for both the individual-parameters (I can take
it) and row-parameter approaches, if we still can't reach a consensus at
that time and need to see all the details.

Ryan, can we focus on the function loading and binding part and get it
committed first? I can also fork your branch and put everything together,
but that might be too big to review.

On Tue, Feb 23, 2021 at 4:35 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
wrote:

> I've been still supporting Ryan's SPIP (original PR and its extension
> proposal discussed here) because of its simplicity.
>
> According to this email thread context, I also understand the different
> perspectives like Hyukjin's concerns about having multiple ways and
> Wenchen's proposal and rationales.
>
> It looks like we need more discussion to reach an agreement. And the
> technical details become more difficult to track because this is an email
> thread.
>
> Although Ryan initially suggested discussing this on Apache email thread
> instead of the PR, can we have a PR to discuss?
>
> Especially, Wenchen, could you make your PR based on Ryan's PR?
>
> If we collect the scattered ideas into a single PR, that would be greatly
> helpful not only for further discussions, but also when we go on a vote on
> Ryan's PR or Wenchen's PR.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Feb 22, 2021 at 1:16 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> Hi Walaa,
>>
>> Thanks for sharing this! The type signature stuff is already covered by
>> the unbound UDF API, which specifies the input and output data types. The
>> problem is how to check the method signature of the bound UDF. As you said,
>> Java has type erasure and we can't check `List<String>` for example.
>>
>> My initial proposal is to do nothing and simply pass the Spark ArrayData,
>> MapData, InternalRow to the UDF. This requires the UDF developers to ensure
>> the type is matched, as they need to call something like
>> `array.getLong(index)` with the corrected type name. It's as worse as the
>> row-parameter version but seems fine as it only happens with nested types.
>> And the type check is still done for the first level (the method signature
>> must use ArrayData/MapData/InternalRow at least).
>>
>> We can allow more types in the future to make the type check better. It
>> might be too detailed for this discussion thread but just put a few
>> thoughts:
>> 1. Java array doesn't do type erasure. We can use UTF8String[] for
>> example if the input type is array of string.
>> 2. For struct type, we can allow Java beans/Scala case classes if the
>> field name and type match the type signature.
>> 3. For map type, it's actually struct<keys: array<key_type>, values:
>> array<value_type>>, so we can also allow Java beans/Scala case classes
>> here.
>>
>> The general idea is to use stuff that can retain nested type information
>> at compile-time, i.e. array, java bean, case classes.
>>
>> Thanks,
>> Wenchen
>>
>>
>>
>> On Mon, Feb 22, 2021 at 3:47 PM Walaa Eldin Moustafa <
>> wa.moustafa@gmail.com> wrote:
>>
>>> Wenchen, in Transport, users provide the input parameter signatures and
>>> output parameter signature as part of the API. Compile-time checks are done
>>> by parsing the type signatures and matching them to the type tree received
>>> at compile-time. This also helps with inferring the concrete output type.
>>>
>>> The specification in the UDF API looks like this:
>>>
>>>   @Override
>>>   public List<String> getInputParameterSignatures() {
>>>     return ImmutableList.of(
>>>         "ARRAY(K)",
>>>         "ARRAY(V)"
>>>     );
>>>   }
>>>
>>>   @Override
>>>   public String getOutputParameterSignature() {
>>>     return "MAP(K,V)";
>>>   }
>>>
>>> The benefits of this type of type signature specification as opposed to
>>> inferring types from Java type signatures given in the Java method are:
>>>
>>>    - For nested types, Java type erasure eliminates the information
>>>    about nested types, so for something like my_function(List<String>
>>>    a1, List<Integer> a2), the value of both a1.class or a2.class is
>>>    just a List. However, we are planning to work around this in a
>>>    future version
>>>    <https://github.com/linkedin/transport/tree/transport-api-v1/transportable-udfs-examples/transportable-udfs-example-udfs/src/main/java/com/linkedin/transport/examples> in
>>>    the case of Array and Map types. Struct types are discussed in the next
>>>    point.
>>>    - Without pre-code-generation there is no single Java type signature
>>>    from which we can capture the Struct info. However, Struct info can be
>>>    expressed in type signatures of the above type, e.g., ROW(FirstName
>>>    VARCHAR, LastName VARCHAR).
>>>
>>> When a Transport UDF represents a Spark UDF, the type signatures are
>>> matched against Spark native types, i.e., org.apache.spark.sql.types.{ArrayType,
>>> MapType, StructType}, and primitive types. The function that
>>> parses/compiles type signatures is found in AbstractTypeInference
>>> <https://github.com/linkedin/transport/blob/master/transportable-udfs-type-system/src/main/java/com/linkedin/transport/typesystem/AbstractTypeInference.java>. This
>>> class represents the generic component that is common between all supported
>>> engines. Its Spark-specific extension is in SparkTypeInference
>>> <https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/typesystem/SparkTypeInference.scala>.
>>> In the above example, at compile time, if the first Array happens to be of
>>> String element type, and the second Array happens to be of Integer element
>>> type, the UDF will communicate to the Spark analyzer that the output should
>>> be of type MapData<String, Integer> (i.e., based on what was seen in
>>> the input at compile time). The whole UDF becomes a Spark Expression
>>> <https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/StdUdfWrapper.scala#L24>
>>> at the end of the day.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Sun, Feb 21, 2021 at 7:26 PM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>
>>>> I think I have made it clear that it's simpler for the UDF developers
>>>> to deal with the input parameters directly, instead of getting them from a
>>>> row, as you need to provide the index and type (e.g. row.getLong(0)).
>>>> It's also coherent with the existing Spark Scala/Java UDF APIs, so that
>>>> Spark users will be more familiar with the individual-parameters API.
>>>>
>>>> And I have explained it already that we can use reflection to make sure
>>>> the defined methods have the right types at query-compilation time. It's
>>>> better than leaving this problem to the UDF developers and asking them to
>>>> ensure the inputs are gotten from the row correctly with index and type. If
>>>> there are people from Presto/Transport, it will be great if you can share
>>>> how Presto/Transport do this check.
>>>>
>>>> I don't like 22 additional interfaces too, but if you look at the
>>>> examples I gave, the current Spark Java UDF
>>>> <https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java> only
>>>> has 9 interfaces, and Transport
>>>> <https://github.com/linkedin/transport/tree/master/transportable-udfs-api/src/main/java/com/linkedin/transport/api/udf> has
>>>> 8. I think it's good enough and people can use VarargsScalarFunction if
>>>> they need to take more parameters or varargs. It resolves your concern
>>>> about doing reflection in the non-codegen execution path that leads to bad
>>>> performance, it also serves for documentation purpose as people can easily
>>>> see the number of UDF inputs and their types by a quick glance.
>>>>
>>>> As I said, we need to investigate how to avoid boxing. Since you are
>>>> asking the question now, I spent sometime to think about it. I think the
>>>> DoubleAdd example is the way to go. For non-codegen code path, we can
>>>> just call the interface method. For the codegen code path, the generated
>>>> Java code would look like (omit the null check logic):
>>>>
>>>> double input1 = ...;
>>>> double input2 = ...;
>>>> DoubleAdd udf = ...;
>>>> double res = udf.call(input1, input2);
>>>>
>>>> Which invokes the primitive version automatically. AFAIK this is also
>>>> how Scala supports primitive type parameter (generate an extra non-boxing
>>>> version of the method). If the UDF doesn't have the primtive version
>>>> method, this code will just call the boxed version and still works.
>>>>
>>>> I don't like the SupportsInvoke approach as it still promotes the
>>>> row-parameter API. I think the individual-parameters API is better for UDF
>>>> developers. Can other people share your opinions about the API?
>>>>
>>>> On Sat, Feb 20, 2021 at 5:32 AM Ryan Blue <rblue@netflix.com> wrote:
>>>>
>>>>> I don’t see any benefit to more complexity with 22 additional
>>>>> interfaces, instead of simply passing an InternalRow. Why not use a
>>>>> single interface with InternalRow? Maybe you could share your
>>>>> motivation?
>>>>>
>>>>> That would also result in strange duplication, where the
>>>>> ScalarFunction2 method is just the boxed version:
>>>>>
>>>>> class DoubleAdd implements ScalarFunction2<Double, Double, Double> {
>>>>>   @Override
>>>>>   Double produceResult(Double left, Double right) {
>>>>>     return left + right;
>>>>>   }
>>>>>
>>>>>   double produceResult(double left, double right) {
>>>>>     return left + right;
>>>>>   }
>>>>> }
>>>>>
>>>>> This would work okay, but would be awkward if you wanted to use the
>>>>> same implementation for any number of arguments, like a sum method
>>>>> that adds all of the arguments together and returns the result. It also
>>>>> isn’t great for varargs, since it is basically the same as the invoke case.
>>>>>
>>>>> The combination of an InternalRow method and the invoke method seems
>>>>> to be a good way to handle this to me. What is wrong with it? And, how
>>>>> would you solve the problem when implementations define methods with the
>>>>> wrong types? The InternalRow approach helps implementations catch
>>>>> that problem (as demonstrated above) and also provides a fallback when
>>>>> there is a but preventing the invoke optimization from working. That seems
>>>>> like a good approach to me.
>>>>>
>>>>> On Thu, Feb 18, 2021 at 11:31 PM Wenchen Fan <cloud0fan@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> If people have such a big concern about reflection, we can follow the current
>>>>>> Spark Java UDF
>>>>>> <https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java>
>>>>>> and Transport
>>>>>> <https://github.com/linkedin/transport/tree/master/transportable-udfs-api/src/main/java/com/linkedin/transport/api/udf>,
>>>>>> and create ScalarFuncion0[R], ScalarFuncion1[T1, R], etc. to avoid
>>>>>> reflection. But we may need to investigate how to avoid boxing with this
>>>>>> API design.
>>>>>>
>>>>>> To put a detailed proposal: let's have ScalarFuncion0, ScalarFuncion1,
>>>>>> ..., ScalarFuncion9 and VarargsScalarFunction. At execution time, if
>>>>>> Spark sees ScalarFuncion0-9, pass the input columns to the UDF
>>>>>> directly, one column one parameter. So string type input is
>>>>>> UTF8String, array type input is ArrayData. If Spark sees
>>>>>> VarargsScalarFunction, wrap the input columns with InternalRow and
>>>>>> pass it to the UDF.
>>>>>>
>>>>>> In general, if VarargsScalarFunction is implemented, the UDF should
>>>>>> not implement ScalarFuncion0-9. We can also define a priority order
>>>>>> to allow this. I don't have a strong preference here.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> On Fri, Feb 19, 2021 at 1:24 PM Walaa Eldin Moustafa <
>>>>>> wa.moustafa@gmail.com> wrote:
>>>>>>
>>>>>>> I agree with Ryan on the questions around the expressivity of the
>>>>>>> Invoke method. It is not clear to me how the Invoke method can be used to
>>>>>>> declare UDFs with type-parameterized parameters. For example: a UDF to get
>>>>>>> the Nth element of an array (regardless of the Array element type) or a UDF
>>>>>>> to merge two Arrays (of generic types) to a Map.
>>>>>>>
>>>>>>> Also, to address Wenchen's InternalRow question, can we create a
>>>>>>> number of Function classes, each corresponding to a number of input
>>>>>>> parameter length (e.g., ScalarFunction1, ScalarFunction2, etc)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 18, 2021 at 6:07 PM Ryan Blue <rblue@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I agree with you that it is better in many cases to directly call a
>>>>>>>> method. But it it not better in all cases, which is why I don’t think it is
>>>>>>>> the right general-purpose choice.
>>>>>>>>
>>>>>>>> First, if codegen isn’t used for some reason, the reflection
>>>>>>>> overhead is really significant. That gets much better when you have an
>>>>>>>> interface to call. That’s one reason I’d use this pattern:
>>>>>>>>
>>>>>>>> class DoubleAdd implements ScalarFunction<Double>, SupportsInvoke {
>>>>>>>>   Double produceResult(InternalRow row) {
>>>>>>>>     return produceResult(row.getDouble(0), row.getDouble(1));
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   double produceResult(double left, double right) {
>>>>>>>>     return left + right;
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> There’s little overhead to adding the InternalRow variation, but
>>>>>>>> we could call it in eval to avoid the reflect overhead. To the
>>>>>>>> point about UDF developers, I think this is a reasonable cost.
>>>>>>>>
>>>>>>>> Second, I think usability is better and helps avoid runtime issues.
>>>>>>>> Here’s an example:
>>>>>>>>
>>>>>>>> class StrLen implements ScalarFunction<Integer>, SupportsInvoke {
>>>>>>>>   Integer produceResult(InternalRow row) {
>>>>>>>>     return produceResult(row.getString(0));
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   Integer produceResult(String str) {
>>>>>>>>     return str.length();
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> See the bug? I forgot to use UTF8String instead of String. With
>>>>>>>> the InternalRow method, I get a compiler warning because getString
>>>>>>>> produces UTF8String that can’t be passed to produceResult(String).
>>>>>>>> If I decided to implement length separately, then we could still
>>>>>>>> run the InternalRow version and log a warning. The code would be
>>>>>>>> slightly slower, but wouldn’t fail.
>>>>>>>>
>>>>>>>> There are similar situations with varargs where it’s better to call
>>>>>>>> methods that produce concrete types than to cast from Object to
>>>>>>>> some expected type.
>>>>>>>>
>>>>>>>> I think that using invoke is a great extension to the proposal, but
>>>>>>>> I don’t think that it should be the only way to call functions. By all
>>>>>>>> means, let’s work on it in parallel and use it where possible. But I think
>>>>>>>> we do need the fallback of using InternalRow and that it isn’t a
>>>>>>>> usability problem to include it.
>>>>>>>>
>>>>>>>> Oh, and one last thought is that we already have users that call
>>>>>>>> Dataset.map and use InternalRow. This would allow converting that code
>>>>>>>> directly to a UDF.
>>>>>>>>
>>>>>>>> I think we’re closer to agreeing here than it actually looks.
>>>>>>>> Hopefully you’ll agree that having the InternalRow method isn’t a
>>>>>>>> big usability problem.
>>>>>>>>
>>>>>>>> On Wed, Feb 17, 2021 at 11:51 PM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I don't see any objections to the rest of the proposal (loading
>>>>>>>>> functions from the catalog, function binding stuff, etc.) and I assume
>>>>>>>>> everyone is OK with it. We can commit that part first.
>>>>>>>>>
>>>>>>>>> Currently, the discussion focuses on the `ScalarFunction` API,
>>>>>>>>> where I think it's better to directly take the input columns as the UDF
>>>>>>>>> parameter, instead of wrapping the input columns with InternalRow
>>>>>>>>> and taking the InternalRow as the UDF parameter. It's not only
>>>>>>>>> for better performance, but also for ease of use. For example, it's easier
>>>>>>>>> for the UDF developer to write `input1 + input2` than `inputRow.getLong(0)
>>>>>>>>> + inputRow.getLong(1)`, as they don't need to specify the
>>>>>>>>> type and index by themselves (getLong(0)) which is error-prone.
>>>>>>>>>
>>>>>>>>> It does push more work to the Spark side, but I think it's worth
>>>>>>>>> it if implementing UDF gets easier. I don't think the work is very
>>>>>>>>> challenging, as we can leverage the infra we built for the expression
>>>>>>>>> encoder.
>>>>>>>>>
>>>>>>>>> I think it's also important to look at the UDF API from the user's
>>>>>>>>> perspective (UDF developers). How do you like the UDF API without
>>>>>>>>> considering how Spark can support it? Do you prefer the
>>>>>>>>> individual-parameters version or the row-parameter version?
>>>>>>>>>
>>>>>>>>> To move forward, how about we implement the function loading and
>>>>>>>>> binding first? Then we can have PRs for both the individual-parameters (I
>>>>>>>>> can take it) and row-parameter approaches, if we still can't reach a
>>>>>>>>> consensus at that time and need to see all the details.
>>>>>>>>>
>>>>>>>>> On Thu, Feb 18, 2021 at 4:48 AM Ryan Blue
>>>>>>>>> <rblue@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Hyukjin. I think that's a fair summary. And I agree with
>>>>>>>>>> the idea that we should focus on what Spark will do by default.
>>>>>>>>>>
>>>>>>>>>> I think we should focus on the proposal, for two reasons: first,
>>>>>>>>>> there is a straightforward path to incorporate Wenchen's suggestion via
>>>>>>>>>> `SupportsInvoke`, and second, the proposal is more complete: it defines a
>>>>>>>>>> solution for many concerns like loading a function and finding out what
>>>>>>>>>> types to use -- not just how to call code -- and supports more use cases
>>>>>>>>>> like varargs functions. I think we can continue to discuss the rest of the
>>>>>>>>>> proposal and be confident that we can support an invoke code path where it
>>>>>>>>>> makes sense.
>>>>>>>>>>
>>>>>>>>>> Does everyone agree? If not, I think we would need to solve a lot
>>>>>>>>>> of the challenges that I initially brought up with the invoke idea. It
>>>>>>>>>> seems like a good way to call a function, but needs a real proposal behind
>>>>>>>>>> it if we don't use it via `SupportsInvoke` in the current proposal.
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 16, 2021 at 11:07 PM Hyukjin Kwon <
>>>>>>>>>> gurwls223@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Just to make sure we don’t move past, I think we haven’t decided
>>>>>>>>>>> yet:
>>>>>>>>>>>
>>>>>>>>>>>    - if we’ll replace the current proposal to Wenchen’s
>>>>>>>>>>>    approach as the default
>>>>>>>>>>>    - if we want to have Wenchen’s approach as an optional
>>>>>>>>>>>    mix-in on the top of Ryan’s proposal (SupportsInvoke)
>>>>>>>>>>>
>>>>>>>>>>> From what I read, some people pointed out it as a replacement.
>>>>>>>>>>> Please correct me if I misread this discussion thread.
>>>>>>>>>>> As Dongjoon pointed out, it would be good to know rough ETA to
>>>>>>>>>>> make sure making progress in this, and people can compare more easily.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> FWIW, there’s the saying I like in the zen of Python
>>>>>>>>>>> <https://www.python.org/dev/peps/pep-0020/>:
>>>>>>>>>>>
>>>>>>>>>>> There should be one— and preferably only one —obvious way to do
>>>>>>>>>>> it.
>>>>>>>>>>>
>>>>>>>>>>> If multiple approaches have the way for developers to do the
>>>>>>>>>>> (almost) same thing, I would prefer to avoid it.
>>>>>>>>>>>
>>>>>>>>>>> In addition, I would prefer to focus on what Spark does by
>>>>>>>>>>> default first.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun <dongjoon.hyun@gmail.com>님이
>>>>>>>>>>> 작성:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, Wenchen.
>>>>>>>>>>>>
>>>>>>>>>>>> This thread seems to get enough attention. Also, I'm expecting
>>>>>>>>>>>> more and more if we have this on the `master` branch because we are
>>>>>>>>>>>> developing together.
>>>>>>>>>>>>
>>>>>>>>>>>>     > Spark SQL has many active contributors/committers and
>>>>>>>>>>>> this thread doesn't get much attention yet.
>>>>>>>>>>>>
>>>>>>>>>>>> So, what's your ETA from now?
>>>>>>>>>>>>
>>>>>>>>>>>>     > I think the problem here is we were discussing some very
>>>>>>>>>>>> detailed things without actual code.
>>>>>>>>>>>>     > I'll implement my idea after the holiday and then we can
>>>>>>>>>>>> have more effective discussions.
>>>>>>>>>>>>     > We can also do benchmarks and get some real numbers.
>>>>>>>>>>>>     > In the meantime, we can continue to discuss other parts
>>>>>>>>>>>> of this proposal, and make a prototype if possible.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm looking forward to seeing your PR. I hope we can conclude
>>>>>>>>>>>> this thread and have at least one implementation in the `master` branch
>>>>>>>>>>>> this month (February).
>>>>>>>>>>>> If you need more time (one month or longer), why don't we have
>>>>>>>>>>>> Ryan's suggestion in the `master` branch first and benchmark with your PR
>>>>>>>>>>>> later during Apache Spark 3.2 timeframe.
>>>>>>>>>>>>
>>>>>>>>>>>> Bests,
>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue
>>>>>>>>>>>> <rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The proposal already includes an API for aggregate functions
>>>>>>>>>>>>> and I think we would want to implement those right away.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Processing ColumnBatch is something we can easily extend the
>>>>>>>>>>>>> interfaces to support, similar to Wenchen's suggestion. The important thing
>>>>>>>>>>>>> right now is to agree on some basic functionality: how to look up functions
>>>>>>>>>>>>> and what the simple API should be. Like the TableCatalog interfaces, we
>>>>>>>>>>>>> will layer on more support through optional interfaces like
>>>>>>>>>>>>> `SupportsInvoke` or `SupportsColumnBatch`.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo <
>>>>>>>>>>>>> andrew.melo@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This proposal looks very interesting. Would future goals for
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> functionality include both support for aggregation functions,
>>>>>>>>>>>>>> as well
>>>>>>>>>>>>>> as support for processing ColumnBatch-es (instead of
>>>>>>>>>>>>>> Row/InternalRow)?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue
>>>>>>>>>>>>>> <rblue@netflix.com.invalid> wrote:
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Thanks for the positive feedback, everyone. It sounds like
>>>>>>>>>>>>>> there is a clear path forward for calling functions. Even without a
>>>>>>>>>>>>>> prototype, the `invoke` plans show that Wenchen's suggested optimization
>>>>>>>>>>>>>> can be done, and incorporating it as an optional extension to this proposal
>>>>>>>>>>>>>> solves many of the unknowns.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > With that area now understood, is there any discussion
>>>>>>>>>>>>>> about other parts of the proposal, besides the function call interface?
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun <
>>>>>>>>>>>>>> sunchao@apache.org> wrote:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> This is an important feature which can unblock several
>>>>>>>>>>>>>> other projects including bucket join support for DataSource v2, complete
>>>>>>>>>>>>>> support for enforcing DataSource v2 distribution requirements on the write
>>>>>>>>>>>>>> path, etc. I like Ryan's proposals which look simple and elegant, with nice
>>>>>>>>>>>>>> support on function overloading and variadic arguments. On the other hand,
>>>>>>>>>>>>>> I think Wenchen made a very good point about performance. Overall, I'm
>>>>>>>>>>>>>> excited to see active discussions on this topic and believe the community
>>>>>>>>>>>>>> will come to a proposal with the best of both sides.
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Chao
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <
>>>>>>>>>>>>>> gurwls223@gmail.com> wrote:
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> +1 for Liang-chi's.
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh <
>>>>>>>>>>>>>> viirya@gmail.com>님이 작성:
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Basically I think the proposal makes sense to me and I'd
>>>>>>>>>>>>>> like to support the
>>>>>>>>>>>>>> >>>> SPIP as it looks like we have strong need for the
>>>>>>>>>>>>>> important feature.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Thanks Ryan for working on this and I do also look
>>>>>>>>>>>>>> forward to Wenchen's
>>>>>>>>>>>>>> >>>> implementation. Thanks for the discussion too.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan
>>>>>>>>>>>>>> looks a good
>>>>>>>>>>>>>> >>>> alternative to me. Besides Wenchen's alternative
>>>>>>>>>>>>>> implementation, is there a
>>>>>>>>>>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> John Zhuge wrote
>>>>>>>>>>>>>> >>>> > Excited to see our Spark community rallying behind
>>>>>>>>>>>>>> this important feature!
>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>> >>>> > The proposal lays a solid foundation of minimal
>>>>>>>>>>>>>> feature set with careful
>>>>>>>>>>>>>> >>>> > considerations for future optimizations and
>>>>>>>>>>>>>> extensions. Can't wait to see
>>>>>>>>>>>>>> >>>> > it leading to more advanced functionalities like views
>>>>>>>>>>>>>> with shared custom
>>>>>>>>>>>>>> >>>> > functions, function pushdown, lambda, etc. It has
>>>>>>>>>>>>>> already borne fruit from
>>>>>>>>>>>>>> >>>> > the constructive collaborations in this thread.
>>>>>>>>>>>>>> Looking forward to
>>>>>>>>>>>>>> >>>> > Wenchen's prototype and further discussions including
>>>>>>>>>>>>>> the SupportsInvoke
>>>>>>>>>>>>>> >>>> > extension proposed by Ryan.
>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > owen.omalley@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>> >>>> > wrote:
>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>> >>>> >> I think this proposal is a very good thing giving
>>>>>>>>>>>>>> Spark a standard way of
>>>>>>>>>>>>>> >>>> >> getting to and calling UDFs.
>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>> >>>> >> I like having the ScalarFunction as the API to call
>>>>>>>>>>>>>> the UDFs. It is
>>>>>>>>>>>>>> >>>> >> simple, yet covers all of the polymorphic type cases
>>>>>>>>>>>>>> well. I think it
>>>>>>>>>>>>>> >>>> >> would
>>>>>>>>>>>>>> >>>> >> also simplify using the functions in other contexts
>>>>>>>>>>>>>> like pushing down
>>>>>>>>>>>>>> >>>> >> filters into the ORC & Parquet readers although there
>>>>>>>>>>>>>> are a lot of
>>>>>>>>>>>>>> >>>> >> details
>>>>>>>>>>>>>> >>>> >> that would need to be considered there.
>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>> >>>> >> .. Owen
>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > ekrogen@.com
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>> >>>> >> wrote:
>>>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>>>> >>>> >>> I agree that there is a strong need for a
>>>>>>>>>>>>>> FunctionCatalog within Spark
>>>>>>>>>>>>>> >>>> >>> to
>>>>>>>>>>>>>> >>>> >>> provide support for shareable UDFs, as well as make
>>>>>>>>>>>>>> movement towards
>>>>>>>>>>>>>> >>>> >>> more
>>>>>>>>>>>>>> >>>> >>> advanced functionality like views which themselves
>>>>>>>>>>>>>> depend on UDFs, so I
>>>>>>>>>>>>>> >>>> >>> support this SPIP wholeheartedly.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> I find both of the proposed UDF APIs to be
>>>>>>>>>>>>>> sufficiently user-friendly
>>>>>>>>>>>>>> >>>> >>> and
>>>>>>>>>>>>>> >>>> >>> extensible. I generally think Wenchen's proposal is
>>>>>>>>>>>>>> easier for a user to
>>>>>>>>>>>>>> >>>> >>> work with in the common case, but has greater
>>>>>>>>>>>>>> potential for confusing
>>>>>>>>>>>>>> >>>> >>> and
>>>>>>>>>>>>>> >>>> >>> hard-to-debug behavior due to use of reflective
>>>>>>>>>>>>>> method signature
>>>>>>>>>>>>>> >>>> >>> searches.
>>>>>>>>>>>>>> >>>> >>> The merits on both sides can hopefully be more
>>>>>>>>>>>>>> properly examined with
>>>>>>>>>>>>>> >>>> >>> code,
>>>>>>>>>>>>>> >>>> >>> so I look forward to seeing an implementation of
>>>>>>>>>>>>>> Wenchen's ideas to
>>>>>>>>>>>>>> >>>> >>> provide
>>>>>>>>>>>>>> >>>> >>> a more concrete comparison. I am optimistic that we
>>>>>>>>>>>>>> will not let the
>>>>>>>>>>>>>> >>>> >>> debate
>>>>>>>>>>>>>> >>>> >>> over this point unreasonably stall the SPIP from
>>>>>>>>>>>>>> making progress.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Thank you to both Wenchen and Ryan for your detailed
>>>>>>>>>>>>>> consideration and
>>>>>>>>>>>>>> >>>> >>> evaluation of these ideas!
>>>>>>>>>>>>>> >>>> >>> ------------------------------
>>>>>>>>>>>>>> >>>> >>> *From:* Dongjoon Hyun &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > dongjoon.hyun@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>> >>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>>>>>>>>>>>>> >>>> >>> *To:* Ryan Blue &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > blue@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>> >>>> >>> *Cc:* Holden Karau &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > holden@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;; Hyukjin Kwon <
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > gurwls223@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> >>; Spark Dev List &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > dev@.apache
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;; Wenchen Fan
>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > cloud0fan@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>> >>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> BTW, I forgot to add my opinion explicitly in this
>>>>>>>>>>>>>> thread because I was
>>>>>>>>>>>>>> >>>> >>> on the PR before this thread.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> 1. The `FunctionCatalog API` PR was made on May 9,
>>>>>>>>>>>>>> 2019 and has been
>>>>>>>>>>>>>> >>>> >>> there for almost two years.
>>>>>>>>>>>>>> >>>> >>> 2. I already gave my +1 on that PR last Saturday
>>>>>>>>>>>>>> because I agreed with
>>>>>>>>>>>>>> >>>> >>> the latest updated design docs and AS-IS PR.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> And, the rest of the progress in this thread is also
>>>>>>>>>>>>>> very satisfying to
>>>>>>>>>>>>>> >>>> >>> me.
>>>>>>>>>>>>>> >>>> >>> (e.g. Ryan's extension suggestion and Wenchen's
>>>>>>>>>>>>>> alternative)
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> To All:
>>>>>>>>>>>>>> >>>> >>> Please take a look at the design doc and the PR, and
>>>>>>>>>>>>>> give us some
>>>>>>>>>>>>>> >>>> >>> opinions.
>>>>>>>>>>>>>> >>>> >>> We really need your participation in order to make
>>>>>>>>>>>>>> DSv2 more complete.
>>>>>>>>>>>>>> >>>> >>> This will unblock other DSv2 features, too.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Bests,
>>>>>>>>>>>>>> >>>> >>> Dongjoon.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > dongjoon.hyun@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>> >>>> >>> wrote:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Hi, Ryan.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> We didn't move past anything (both yours and
>>>>>>>>>>>>>> Wenchen's). What Wenchen
>>>>>>>>>>>>>> >>>> >>> suggested is double-checking the alternatives with
>>>>>>>>>>>>>> the implementation to
>>>>>>>>>>>>>> >>>> >>> give more momentum to our discussion.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Your new suggestion about optional extention also
>>>>>>>>>>>>>> sounds like a new
>>>>>>>>>>>>>> >>>> >>> reasonable alternative to me.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> We are still discussing this topic together and I
>>>>>>>>>>>>>> hope we can make a
>>>>>>>>>>>>>> >>>> >>> conclude at this time (for Apache Spark 3.2) without
>>>>>>>>>>>>>> being stucked like
>>>>>>>>>>>>>> >>>> >>> last time.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> I really appreciate your leadership in this
>>>>>>>>>>>>>> dicsussion and the moving
>>>>>>>>>>>>>> >>>> >>> direction of this discussion looks constructive to
>>>>>>>>>>>>>> me. Let's give some
>>>>>>>>>>>>>> >>>> >>> time
>>>>>>>>>>>>>> >>>> >>> to the alternatives.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Bests,
>>>>>>>>>>>>>> >>>> >>> Dongjoon.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > blue@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> I don’t think we should so quickly move past the
>>>>>>>>>>>>>> drawbacks of this
>>>>>>>>>>>>>> >>>> >>> approach. The problems are significant enough that
>>>>>>>>>>>>>> using invoke is not
>>>>>>>>>>>>>> >>>> >>> sufficient on its own. But, I think we can add it as
>>>>>>>>>>>>>> an optional
>>>>>>>>>>>>>> >>>> >>> extension
>>>>>>>>>>>>>> >>>> >>> to shore up the weaknesses.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Here’s a summary of the drawbacks:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>    - Magic function signatures are error-prone
>>>>>>>>>>>>>> >>>> >>>    - Spark would need considerable code to help
>>>>>>>>>>>>>> users find what went
>>>>>>>>>>>>>> >>>> >>>    wrong
>>>>>>>>>>>>>> >>>> >>>    - Spark would likely need to coerce arguments
>>>>>>>>>>>>>> (e.g., String,
>>>>>>>>>>>>>> >>>> >>>    Option[Int]) for usability
>>>>>>>>>>>>>> >>>> >>>    - It is unclear how Spark will find the Java
>>>>>>>>>>>>>> Method to call
>>>>>>>>>>>>>> >>>> >>>    - Use cases that require varargs fall back to
>>>>>>>>>>>>>> casting; users will
>>>>>>>>>>>>>> >>>> >>>    also get this wrong (cast to String instead of
>>>>>>>>>>>>>> UTF8String)
>>>>>>>>>>>>>> >>>> >>>    - The non-codegen path is significantly slower
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> The benefit of invoke is to avoid moving data into a
>>>>>>>>>>>>>> row, like this:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> -- using invoke
>>>>>>>>>>>>>> >>>> >>> int result = udfFunction(x, y)
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> -- using row
>>>>>>>>>>>>>> >>>> >>> udfRow.update(0, x); -- actual: values[0] = x;
>>>>>>>>>>>>>> >>>> >>> udfRow.update(1, y);
>>>>>>>>>>>>>> >>>> >>> int result = udfFunction(udfRow);
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> And, again, that won’t actually help much in cases
>>>>>>>>>>>>>> that require varargs.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> I suggest we add a new marker trait for BoundMethod
>>>>>>>>>>>>>> called
>>>>>>>>>>>>>> >>>> >>> SupportsInvoke.
>>>>>>>>>>>>>> >>>> >>> If that interface is implemented, then Spark will
>>>>>>>>>>>>>> look for a method that
>>>>>>>>>>>>>> >>>> >>> matches the expected signature based on the bound
>>>>>>>>>>>>>> input type. If it
>>>>>>>>>>>>>> >>>> >>> isn’t
>>>>>>>>>>>>>> >>>> >>> found, Spark can print a warning and fall back to
>>>>>>>>>>>>>> the InternalRow call:
>>>>>>>>>>>>>> >>>> >>> “Cannot find udfFunction(int, int)”.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> This approach allows the invoke optimization, but
>>>>>>>>>>>>>> solves many of the
>>>>>>>>>>>>>> >>>> >>> problems:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>    - The method to invoke is found using the
>>>>>>>>>>>>>> proposed load and bind
>>>>>>>>>>>>>> >>>> >>>    approach
>>>>>>>>>>>>>> >>>> >>>    - Magic function signatures are optional and do
>>>>>>>>>>>>>> not cause runtime
>>>>>>>>>>>>>> >>>> >>>    failures
>>>>>>>>>>>>>> >>>> >>>    - Because this is an optional optimization, Spark
>>>>>>>>>>>>>> can be more strict
>>>>>>>>>>>>>> >>>> >>>    about types
>>>>>>>>>>>>>> >>>> >>>    - Varargs cases can still use rows
>>>>>>>>>>>>>> >>>> >>>    - Non-codegen can use an evaluation method rather
>>>>>>>>>>>>>> than falling back
>>>>>>>>>>>>>> >>>> >>>    to slow Java reflection
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> This seems like a good extension to me; this
>>>>>>>>>>>>>> provides a plan for
>>>>>>>>>>>>>> >>>> >>> optimizing the UDF call to avoid building a row,
>>>>>>>>>>>>>> while the existing
>>>>>>>>>>>>>> >>>> >>> proposal covers the other cases well and addresses
>>>>>>>>>>>>>> how to locate these
>>>>>>>>>>>>>> >>>> >>> function calls.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> This also highlights that the approach used in DSv2
>>>>>>>>>>>>>> and this proposal is
>>>>>>>>>>>>>> >>>> >>> working: start small and use extensions to layer on
>>>>>>>>>>>>>> more complex
>>>>>>>>>>>>>> >>>> >>> support.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > dongjoon.hyun@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;
>>>>>>>>>>>>>> >>>> >>> wrote:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Thank you all for making a giant move forward for
>>>>>>>>>>>>>> Apache Spark 3.2.0.
>>>>>>>>>>>>>> >>>> >>> I'm really looking forward to seeing Wenchen's
>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>> >>>> >>> That would be greatly helpful to make a decision!
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> > I'll implement my idea after the holiday and then
>>>>>>>>>>>>>> we can have
>>>>>>>>>>>>>> >>>> >>> more effective discussions. We can also do
>>>>>>>>>>>>>> benchmarks and get some real
>>>>>>>>>>>>>> >>>> >>> numbers.
>>>>>>>>>>>>>> >>>> >>> > FYI: the Presto UDF API
>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067978066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=iMWmHqqXPcT7EK%2Bovyzhy%2BZpU6Llih%2BwdZD53wvobmc%3D&amp;reserved=0&gt
>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>> >>>> >>> also
>>>>>>>>>>>>>> >>>> >>> takes individual parameters instead of the row
>>>>>>>>>>>>>> parameter. I think this
>>>>>>>>>>>>>> >>>> >>> direction at least worth a try so that we can see
>>>>>>>>>>>>>> the performance
>>>>>>>>>>>>>> >>>> >>> difference. It's also mentioned in the design doc as
>>>>>>>>>>>>>> an alternative
>>>>>>>>>>>>>> >>>> >>> (Trino).
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Bests,
>>>>>>>>>>>>>> >>>> >>> Dongjoon.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > cloud0fan@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> FYI: the Presto UDF API
>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZSBCR7yx3PpwL4KY9V73JG42Z02ZodqkjxC0LweHt1g%3D&amp;reserved=0&gt
>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>> >>>> >>> also takes individual parameters instead of the row
>>>>>>>>>>>>>> parameter. I think
>>>>>>>>>>>>>> >>>> >>> this
>>>>>>>>>>>>>> >>>> >>> direction at least worth a try so that we can see
>>>>>>>>>>>>>> the performance
>>>>>>>>>>>>>> >>>> >>> difference. It's also mentioned in the design doc as
>>>>>>>>>>>>>> an alternative
>>>>>>>>>>>>>> >>>> >>> (Trino).
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > cloud0fan@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Hi Holden,
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> As Hyukjin said, following existing designs is not
>>>>>>>>>>>>>> the principle of DS
>>>>>>>>>>>>>> >>>> >>> v2
>>>>>>>>>>>>>> >>>> >>> API design. We should make sure the DS v2 API makes
>>>>>>>>>>>>>> sense. AFAIK we
>>>>>>>>>>>>>> >>>> >>> didn't
>>>>>>>>>>>>>> >>>> >>> fully follow the catalog API design from Hive and I
>>>>>>>>>>>>>> believe Ryan also
>>>>>>>>>>>>>> >>>> >>> agrees with it.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> I think the problem here is we were discussing some
>>>>>>>>>>>>>> very detailed things
>>>>>>>>>>>>>> >>>> >>> without actual code. I'll implement my idea after
>>>>>>>>>>>>>> the holiday and then
>>>>>>>>>>>>>> >>>> >>> we
>>>>>>>>>>>>>> >>>> >>> can have more effective discussions. We can also do
>>>>>>>>>>>>>> benchmarks and get
>>>>>>>>>>>>>> >>>> >>> some
>>>>>>>>>>>>>> >>>> >>> real numbers.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> In the meantime, we can continue to discuss other
>>>>>>>>>>>>>> parts of this
>>>>>>>>>>>>>> >>>> >>> proposal,
>>>>>>>>>>>>>> >>>> >>> and make a prototype if possible. Spark SQL has many
>>>>>>>>>>>>>> active
>>>>>>>>>>>>>> >>>> >>> contributors/committers and this thread doesn't get
>>>>>>>>>>>>>> much attention yet.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > gurwls223@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Just dropping a few lines. I remember that one of
>>>>>>>>>>>>>> the goals in DSv2 is
>>>>>>>>>>>>>> >>>> >>> to
>>>>>>>>>>>>>> >>>> >>> correct the mistakes we made in the current Spark
>>>>>>>>>>>>>> codes.
>>>>>>>>>>>>>> >>>> >>> It would not have much point if we will happen to
>>>>>>>>>>>>>> just follow and mimic
>>>>>>>>>>>>>> >>>> >>> what Spark currently does. It might just end up with
>>>>>>>>>>>>>> another copy of
>>>>>>>>>>>>>> >>>> >>> Spark
>>>>>>>>>>>>>> >>>> >>> APIs, e.g. Expression (internal) APIs. I sincerely
>>>>>>>>>>>>>> would like to avoid
>>>>>>>>>>>>>> >>>> >>> this
>>>>>>>>>>>>>> >>>> >>> I do believe we have been stuck mainly due to trying
>>>>>>>>>>>>>> to come up with a
>>>>>>>>>>>>>> >>>> >>> better design. We already have an ugly picture of
>>>>>>>>>>>>>> the current Spark APIs
>>>>>>>>>>>>>> >>>> >>> to
>>>>>>>>>>>>>> >>>> >>> draw a better bigger picture.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > holden@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt;님이 작성:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> I think this proposal is a good set of trade-offs
>>>>>>>>>>>>>> and has existed in the
>>>>>>>>>>>>>> >>>> >>> community for a long period of time. I especially
>>>>>>>>>>>>>> appreciate how the
>>>>>>>>>>>>>> >>>> >>> design
>>>>>>>>>>>>>> >>>> >>> is focused on a minimal useful component, with
>>>>>>>>>>>>>> future optimizations
>>>>>>>>>>>>>> >>>> >>> considered from a point of view of making sure it's
>>>>>>>>>>>>>> flexible, but actual
>>>>>>>>>>>>>> >>>> >>> concrete decisions left for the future once we see
>>>>>>>>>>>>>> how this API is used.
>>>>>>>>>>>>>> >>>> >>> I
>>>>>>>>>>>>>> >>>> >>> think if we try and optimize everything right out of
>>>>>>>>>>>>>> the gate, we'll
>>>>>>>>>>>>>> >>>> >>> quickly get stuck (again) and not make any progress.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue &lt;
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > blue@
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > &gt; wrote:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Hi everyone,
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> I'd like to start a discussion for adding a
>>>>>>>>>>>>>> FunctionCatalog interface to
>>>>>>>>>>>>>> >>>> >>> catalog plugins. This will allow catalogs to expose
>>>>>>>>>>>>>> functions to Spark,
>>>>>>>>>>>>>> >>>> >>> similar to how the TableCatalog interface allows a
>>>>>>>>>>>>>> catalog to expose
>>>>>>>>>>>>>> >>>> >>> tables. The proposal doc is available here:
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U%2Fedit&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Kyth8%2FhNUZ6GXG2FsgcknZ7t7s0%2BpxnDMPyxvsxLLqE%3D&amp;reserved=0&gt
>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Here's a high-level summary of some of the main
>>>>>>>>>>>>>> design choices:
>>>>>>>>>>>>>> >>>> >>> * Adds the ability to list and load functions, not
>>>>>>>>>>>>>> to create or modify
>>>>>>>>>>>>>> >>>> >>> them in an external catalog
>>>>>>>>>>>>>> >>>> >>> * Supports scalar, aggregate, and partial aggregate
>>>>>>>>>>>>>> functions
>>>>>>>>>>>>>> >>>> >>> * Uses load and bind steps for better error messages
>>>>>>>>>>>>>> and simpler
>>>>>>>>>>>>>> >>>> >>> implementations
>>>>>>>>>>>>>> >>>> >>> * Like the DSv2 table read and write APIs, it uses
>>>>>>>>>>>>>> InternalRow to pass
>>>>>>>>>>>>>> >>>> >>> data
>>>>>>>>>>>>>> >>>> >>> * Can be extended using mix-in interfaces to add
>>>>>>>>>>>>>> vectorization, codegen,
>>>>>>>>>>>>>> >>>> >>> and other future features
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> There is also a PR with the proposed API:
>>>>>>>>>>>>>> >>>> >>> https://github.com/apache/spark/pull/24559/files
>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F24559%2Ffiles&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t3ZCqffdsrmCY3X%2FT8x1oMjMcNUiQ0wQNk%2ByAXQx1Io%3D&amp;reserved=0&gt
>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> Let's discuss the proposal here rather than on that
>>>>>>>>>>>>>> PR, to get better
>>>>>>>>>>>>>> >>>> >>> visibility. Also, please take the time to read the
>>>>>>>>>>>>>> proposal first. That
>>>>>>>>>>>>>> >>>> >>> really helps clear up misconceptions.
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> --
>>>>>>>>>>>>>> >>>> >>> Ryan Blue
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> --
>>>>>>>>>>>>>> >>>> >>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D&amp;reserved=0&gt
>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>> >>>> >>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>>>>> >>>> >>> https://amzn.to/2MaRAG9
>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D&amp;reserved=0&gt
>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>> >>>> >>> YouTube Live Streams:
>>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>>>> >>>> >>> &lt;
>>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D&amp;reserved=0&gt
>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>> --
>>>>>>>>>>>>>> >>>> >>> Ryan Blue
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >>>
>>>>>>>>>>>>>> >>>> >
>>>>>>>>>>>>>> >>>> > --
>>>>>>>>>>>>>> >>>> > John Zhuge
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> --
>>>>>>>>>>>>>> >>>> Sent from:
>>>>>>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> >>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > --
>>>>>>>>>>>>>> > Ryan Blue
>>>>>>>>>>>>>> > Software Engineer
>>>>>>>>>>>>>> > Netflix
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>

Mime
View raw message