spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: [DISCUSS] SPIP: FunctionCatalog
Date Fri, 19 Feb 2021 07:31:00 GMT
If people have such a big concern about reflection, we can follow the current
Spark Java UDF
<https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java>
and Transport
<https://github.com/linkedin/transport/tree/master/transportable-udfs-api/src/main/java/com/linkedin/transport/api/udf>,
and create ScalarFuncion0[R], ScalarFuncion1[T1, R], etc. to avoid
reflection. But we may need to investigate how to avoid boxing with this
API design.

To put a detailed proposal: let's have ScalarFuncion0, ScalarFuncion1, ...,
ScalarFuncion9 and VarargsScalarFunction. At execution time, if Spark sees
ScalarFuncion0-9, pass the input columns to the UDF directly, one column
one parameter. So string type input is UTF8String, array type input is
ArrayData. If Spark sees VarargsScalarFunction, wrap the input columns with
InternalRow and pass it to the UDF.

In general, if VarargsScalarFunction is implemented, the UDF should not
implement ScalarFuncion0-9. We can also define a priority order to allow
this. I don't have a strong preference here.

What do you think?

On Fri, Feb 19, 2021 at 1:24 PM Walaa Eldin Moustafa <wa.moustafa@gmail.com>
wrote:

> I agree with Ryan on the questions around the expressivity of the Invoke
> method. It is not clear to me how the Invoke method can be used to declare
> UDFs with type-parameterized parameters. For example: a UDF to get the Nth
> element of an array (regardless of the Array element type) or a UDF to
> merge two Arrays (of generic types) to a Map.
>
> Also, to address Wenchen's InternalRow question, can we create a number of
> Function classes, each corresponding to a number of input parameter length
> (e.g., ScalarFunction1, ScalarFunction2, etc)?
>
> Thanks,
> Walaa.
>
>
> On Thu, Feb 18, 2021 at 6:07 PM Ryan Blue <rblue@netflix.com.invalid>
> wrote:
>
>> I agree with you that it is better in many cases to directly call a
>> method. But it it not better in all cases, which is why I don’t think it is
>> the right general-purpose choice.
>>
>> First, if codegen isn’t used for some reason, the reflection overhead is
>> really significant. That gets much better when you have an interface to
>> call. That’s one reason I’d use this pattern:
>>
>> class DoubleAdd implements ScalarFunction<Double>, SupportsInvoke {
>>   Double produceResult(InternalRow row) {
>>     return produceResult(row.getDouble(0), row.getDouble(1));
>>   }
>>
>>   double produceResult(double left, double right) {
>>     return left + right;
>>   }
>> }
>>
>> There’s little overhead to adding the InternalRow variation, but we
>> could call it in eval to avoid the reflect overhead. To the point about
>> UDF developers, I think this is a reasonable cost.
>>
>> Second, I think usability is better and helps avoid runtime issues.
>> Here’s an example:
>>
>> class StrLen implements ScalarFunction<Integer>, SupportsInvoke {
>>   Integer produceResult(InternalRow row) {
>>     return produceResult(row.getString(0));
>>   }
>>
>>   Integer produceResult(String str) {
>>     return str.length();
>>   }
>> }
>>
>> See the bug? I forgot to use UTF8String instead of String. With the
>> InternalRow method, I get a compiler warning because getString produces
>> UTF8String that can’t be passed to produceResult(String). If I decided
>> to implement length separately, then we could still run the InternalRow
>> version and log a warning. The code would be slightly slower, but wouldn’t
>> fail.
>>
>> There are similar situations with varargs where it’s better to call
>> methods that produce concrete types than to cast from Object to some
>> expected type.
>>
>> I think that using invoke is a great extension to the proposal, but I
>> don’t think that it should be the only way to call functions. By all means,
>> let’s work on it in parallel and use it where possible. But I think we do
>> need the fallback of using InternalRow and that it isn’t a usability
>> problem to include it.
>>
>> Oh, and one last thought is that we already have users that call
>> Dataset.map and use InternalRow. This would allow converting that code
>> directly to a UDF.
>>
>> I think we’re closer to agreeing here than it actually looks. Hopefully
>> you’ll agree that having the InternalRow method isn’t a big usability
>> problem.
>>
>> On Wed, Feb 17, 2021 at 11:51 PM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> I don't see any objections to the rest of the proposal (loading
>>> functions from the catalog, function binding stuff, etc.) and I assume
>>> everyone is OK with it. We can commit that part first.
>>>
>>> Currently, the discussion focuses on the `ScalarFunction` API, where I
>>> think it's better to directly take the input columns as the UDF parameter,
>>> instead of wrapping the input columns with InternalRow and taking the
>>> InternalRow as the UDF parameter. It's not only for better performance,
>>> but also for ease of use. For example, it's easier for the UDF developer to
>>> write `input1 + input2` than `inputRow.getLong(0) + inputRow.getLong(1)`,
>>> as they don't need to specify the type and index by themselves (
>>> getLong(0)) which is error-prone.
>>>
>>> It does push more work to the Spark side, but I think it's worth it if
>>> implementing UDF gets easier. I don't think the work is very challenging,
>>> as we can leverage the infra we built for the expression encoder.
>>>
>>> I think it's also important to look at the UDF API from the user's
>>> perspective (UDF developers). How do you like the UDF API without
>>> considering how Spark can support it? Do you prefer the
>>> individual-parameters version or the row-parameter version?
>>>
>>> To move forward, how about we implement the function loading and binding
>>> first? Then we can have PRs for both the individual-parameters (I can take
>>> it) and row-parameter approaches, if we still can't reach a consensus at
>>> that time and need to see all the details.
>>>
>>> On Thu, Feb 18, 2021 at 4:48 AM Ryan Blue <rblue@netflix.com.invalid>
>>> wrote:
>>>
>>>> Thanks, Hyukjin. I think that's a fair summary. And I agree with the
>>>> idea that we should focus on what Spark will do by default.
>>>>
>>>> I think we should focus on the proposal, for two reasons: first, there
>>>> is a straightforward path to incorporate Wenchen's suggestion via
>>>> `SupportsInvoke`, and second, the proposal is more complete: it defines a
>>>> solution for many concerns like loading a function and finding out what
>>>> types to use -- not just how to call code -- and supports more use cases
>>>> like varargs functions. I think we can continue to discuss the rest of the
>>>> proposal and be confident that we can support an invoke code path where it
>>>> makes sense.
>>>>
>>>> Does everyone agree? If not, I think we would need to solve a lot of
>>>> the challenges that I initially brought up with the invoke idea. It seems
>>>> like a good way to call a function, but needs a real proposal behind it if
>>>> we don't use it via `SupportsInvoke` in the current proposal.
>>>>
>>>> On Tue, Feb 16, 2021 at 11:07 PM Hyukjin Kwon <gurwls223@gmail.com>
>>>> wrote:
>>>>
>>>>> Just to make sure we don’t move past, I think we haven’t decided yet:
>>>>>
>>>>>    - if we’ll replace the current proposal to Wenchen’s approach as
>>>>>    the default
>>>>>    - if we want to have Wenchen’s approach as an optional mix-in on
>>>>>    the top of Ryan’s proposal (SupportsInvoke)
>>>>>
>>>>> From what I read, some people pointed out it as a replacement. Please
>>>>> correct me if I misread this discussion thread.
>>>>> As Dongjoon pointed out, it would be good to know rough ETA to make
>>>>> sure making progress in this, and people can compare more easily.
>>>>>
>>>>>
>>>>> FWIW, there’s the saying I like in the zen of Python
>>>>> <https://www.python.org/dev/peps/pep-0020/>:
>>>>>
>>>>> There should be one— and preferably only one —obvious way to do it.
>>>>>
>>>>> If multiple approaches have the way for developers to do the (almost)
>>>>> same thing, I would prefer to avoid it.
>>>>>
>>>>> In addition, I would prefer to focus on what Spark does by default
>>>>> first.
>>>>>
>>>>>
>>>>> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun <dongjoon.hyun@gmail.com>님이
>>>>> 작성:
>>>>>
>>>>>> Hi, Wenchen.
>>>>>>
>>>>>> This thread seems to get enough attention. Also, I'm expecting more
>>>>>> and more if we have this on the `master` branch because we are developing
>>>>>> together.
>>>>>>
>>>>>>     > Spark SQL has many active contributors/committers and this
>>>>>> thread doesn't get much attention yet.
>>>>>>
>>>>>> So, what's your ETA from now?
>>>>>>
>>>>>>     > I think the problem here is we were discussing some very
>>>>>> detailed things without actual code.
>>>>>>     > I'll implement my idea after the holiday and then we can have
>>>>>> more effective discussions.
>>>>>>     > We can also do benchmarks and get some real numbers.
>>>>>>     > In the meantime, we can continue to discuss other parts of this
>>>>>> proposal, and make a prototype if possible.
>>>>>>
>>>>>> I'm looking forward to seeing your PR. I hope we can conclude this
>>>>>> thread and have at least one implementation in the `master` branch this
>>>>>> month (February).
>>>>>> If you need more time (one month or longer), why don't we have Ryan's
>>>>>> suggestion in the `master` branch first and benchmark with your PR later
>>>>>> during Apache Spark 3.2 timeframe.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue <rblue@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Andrew,
>>>>>>>
>>>>>>> The proposal already includes an API for aggregate functions and I
>>>>>>> think we would want to implement those right away.
>>>>>>>
>>>>>>> Processing ColumnBatch is something we can easily extend the
>>>>>>> interfaces to support, similar to Wenchen's suggestion. The important thing
>>>>>>> right now is to agree on some basic functionality: how to look up functions
>>>>>>> and what the simple API should be. Like the TableCatalog interfaces, we
>>>>>>> will layer on more support through optional interfaces like
>>>>>>> `SupportsInvoke` or `SupportsColumnBatch`.
>>>>>>>
>>>>>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo <andrew.melo@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Ryan,
>>>>>>>>
>>>>>>>> This proposal looks very interesting. Would future goals for this
>>>>>>>> functionality include both support for aggregation functions, as
>>>>>>>> well
>>>>>>>> as support for processing ColumnBatch-es (instead of
>>>>>>>> Row/InternalRow)?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue
>>>>>>>> <rblue@netflix.com.invalid> wrote:
>>>>>>>> >
>>>>>>>> > Thanks for the positive feedback, everyone. It sounds like there
>>>>>>>> is a clear path forward for calling functions. Even without a prototype,
>>>>>>>> the `invoke` plans show that Wenchen's suggested optimization can be done,
>>>>>>>> and incorporating it as an optional extension to this proposal solves many
>>>>>>>> of the unknowns.
>>>>>>>> >
>>>>>>>> > With that area now understood, is there any discussion about
>>>>>>>> other parts of the proposal, besides the function call interface?
>>>>>>>> >
>>>>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun <sunchao@apache.org>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >> This is an important feature which can unblock several other
>>>>>>>> projects including bucket join support for DataSource v2, complete support
>>>>>>>> for enforcing DataSource v2 distribution requirements on the write path,
>>>>>>>> etc. I like Ryan's proposals which look simple and elegant, with nice
>>>>>>>> support on function overloading and variadic arguments. On the other hand,
>>>>>>>> I think Wenchen made a very good point about performance. Overall, I'm
>>>>>>>> excited to see active discussions on this topic and believe the community
>>>>>>>> will come to a proposal with the best of both sides.
>>>>>>>> >>
>>>>>>>> >> Chao
>>>>>>>> >>
>>>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <
>>>>>>>> gurwls223@gmail.com> wrote:
>>>>>>>> >>>
>>>>>>>> >>> +1 for Liang-chi's.
>>>>>>>> >>>
>>>>>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh <viirya@gmail.com>님이
>>>>>>>> 작성:
>>>>>>>> >>>>
>>>>>>>> >>>> Basically I think the proposal makes sense to me and I'd like
>>>>>>>> to support the
>>>>>>>> >>>> SPIP as it looks like we have strong need for the important
>>>>>>>> feature.
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks Ryan for working on this and I do also look forward to
>>>>>>>> Wenchen's
>>>>>>>> >>>> implementation. Thanks for the discussion too.
>>>>>>>> >>>>
>>>>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a
>>>>>>>> good
>>>>>>>> >>>> alternative to me. Besides Wenchen's alternative
>>>>>>>> implementation, is there a
>>>>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> John Zhuge wrote
>>>>>>>> >>>> > Excited to see our Spark community rallying behind this
>>>>>>>> important feature!
>>>>>>>> >>>> >
>>>>>>>> >>>> > The proposal lays a solid foundation of minimal feature set
>>>>>>>> with careful
>>>>>>>> >>>> > considerations for future optimizations and extensions.
>>>>>>>> Can't wait to see
>>>>>>>> >>>> > it leading to more advanced functionalities like views with
>>>>>>>> shared custom
>>>>>>>> >>>> > functions, function pushdown, lambda, etc. It has already
>>>>>>>> borne fruit from
>>>>>>>> >>>> > the constructive collaborations in this thread. Looking
>>>>>>>> forward to
>>>>>>>> >>>> > Wenchen's prototype and further discussions including the
>>>>>>>> SupportsInvoke
>>>>>>>> >>>> > extension proposed by Ryan.
>>>>>>>> >>>> >
>>>>>>>> >>>> >
>>>>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > owen.omalley@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;
>>>>>>>> >>>> > wrote:
>>>>>>>> >>>> >
>>>>>>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>>>>>>> standard way of
>>>>>>>> >>>> >> getting to and calling UDFs.
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> I like having the ScalarFunction as the API to call the
>>>>>>>> UDFs. It is
>>>>>>>> >>>> >> simple, yet covers all of the polymorphic type cases well.
>>>>>>>> I think it
>>>>>>>> >>>> >> would
>>>>>>>> >>>> >> also simplify using the functions in other contexts like
>>>>>>>> pushing down
>>>>>>>> >>>> >> filters into the ORC & Parquet readers although there are a
>>>>>>>> lot of
>>>>>>>> >>>> >> details
>>>>>>>> >>>> >> that would need to be considered there.
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> .. Owen
>>>>>>>> >>>> >>
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > ekrogen@.com
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;
>>>>>>>> >>>> >> wrote:
>>>>>>>> >>>> >>
>>>>>>>> >>>> >>> I agree that there is a strong need for a FunctionCatalog
>>>>>>>> within Spark
>>>>>>>> >>>> >>> to
>>>>>>>> >>>> >>> provide support for shareable UDFs, as well as make
>>>>>>>> movement towards
>>>>>>>> >>>> >>> more
>>>>>>>> >>>> >>> advanced functionality like views which themselves depend
>>>>>>>> on UDFs, so I
>>>>>>>> >>>> >>> support this SPIP wholeheartedly.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> I find both of the proposed UDF APIs to be sufficiently
>>>>>>>> user-friendly
>>>>>>>> >>>> >>> and
>>>>>>>> >>>> >>> extensible. I generally think Wenchen's proposal is easier
>>>>>>>> for a user to
>>>>>>>> >>>> >>> work with in the common case, but has greater potential
>>>>>>>> for confusing
>>>>>>>> >>>> >>> and
>>>>>>>> >>>> >>> hard-to-debug behavior due to use of reflective method
>>>>>>>> signature
>>>>>>>> >>>> >>> searches.
>>>>>>>> >>>> >>> The merits on both sides can hopefully be more properly
>>>>>>>> examined with
>>>>>>>> >>>> >>> code,
>>>>>>>> >>>> >>> so I look forward to seeing an implementation of Wenchen's
>>>>>>>> ideas to
>>>>>>>> >>>> >>> provide
>>>>>>>> >>>> >>> a more concrete comparison. I am optimistic that we will
>>>>>>>> not let the
>>>>>>>> >>>> >>> debate
>>>>>>>> >>>> >>> over this point unreasonably stall the SPIP from making
>>>>>>>> progress.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Thank you to both Wenchen and Ryan for your detailed
>>>>>>>> consideration and
>>>>>>>> >>>> >>> evaluation of these ideas!
>>>>>>>> >>>> >>> ------------------------------
>>>>>>>> >>>> >>> *From:* Dongjoon Hyun &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > dongjoon.hyun@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;
>>>>>>>> >>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>>>>>>> >>>> >>> *To:* Ryan Blue &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > blue@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;
>>>>>>>> >>>> >>> *Cc:* Holden Karau &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > holden@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;; Hyukjin Kwon <
>>>>>>>> >>>> >>>
>>>>>>>> >>>>
>>>>>>>> >>>> > gurwls223@
>>>>>>>> >>>>
>>>>>>>> >>>> >>; Spark Dev List &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > dev@.apache
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;; Wenchen Fan
>>>>>>>> >>>> >>> &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > cloud0fan@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;
>>>>>>>> >>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> BTW, I forgot to add my opinion explicitly in this thread
>>>>>>>> because I was
>>>>>>>> >>>> >>> on the PR before this thread.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019
>>>>>>>> and has been
>>>>>>>> >>>> >>> there for almost two years.
>>>>>>>> >>>> >>> 2. I already gave my +1 on that PR last Saturday because I
>>>>>>>> agreed with
>>>>>>>> >>>> >>> the latest updated design docs and AS-IS PR.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> And, the rest of the progress in this thread is also very
>>>>>>>> satisfying to
>>>>>>>> >>>> >>> me.
>>>>>>>> >>>> >>> (e.g. Ryan's extension suggestion and Wenchen's
>>>>>>>> alternative)
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> To All:
>>>>>>>> >>>> >>> Please take a look at the design doc and the PR, and give
>>>>>>>> us some
>>>>>>>> >>>> >>> opinions.
>>>>>>>> >>>> >>> We really need your participation in order to make DSv2
>>>>>>>> more complete.
>>>>>>>> >>>> >>> This will unblock other DSv2 features, too.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Bests,
>>>>>>>> >>>> >>> Dongjoon.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > dongjoon.hyun@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;
>>>>>>>> >>>> >>> wrote:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Hi, Ryan.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> We didn't move past anything (both yours and Wenchen's).
>>>>>>>> What Wenchen
>>>>>>>> >>>> >>> suggested is double-checking the alternatives with the
>>>>>>>> implementation to
>>>>>>>> >>>> >>> give more momentum to our discussion.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Your new suggestion about optional extention also sounds
>>>>>>>> like a new
>>>>>>>> >>>> >>> reasonable alternative to me.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> We are still discussing this topic together and I hope we
>>>>>>>> can make a
>>>>>>>> >>>> >>> conclude at this time (for Apache Spark 3.2) without being
>>>>>>>> stucked like
>>>>>>>> >>>> >>> last time.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> I really appreciate your leadership in this dicsussion and
>>>>>>>> the moving
>>>>>>>> >>>> >>> direction of this discussion looks constructive to me.
>>>>>>>> Let's give some
>>>>>>>> >>>> >>> time
>>>>>>>> >>>> >>> to the alternatives.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Bests,
>>>>>>>> >>>> >>> Dongjoon.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > blue@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt; wrote:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> I don’t think we should so quickly move past the drawbacks
>>>>>>>> of this
>>>>>>>> >>>> >>> approach. The problems are significant enough that using
>>>>>>>> invoke is not
>>>>>>>> >>>> >>> sufficient on its own. But, I think we can add it as an
>>>>>>>> optional
>>>>>>>> >>>> >>> extension
>>>>>>>> >>>> >>> to shore up the weaknesses.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Here’s a summary of the drawbacks:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>    - Magic function signatures are error-prone
>>>>>>>> >>>> >>>    - Spark would need considerable code to help users find
>>>>>>>> what went
>>>>>>>> >>>> >>>    wrong
>>>>>>>> >>>> >>>    - Spark would likely need to coerce arguments (e.g.,
>>>>>>>> String,
>>>>>>>> >>>> >>>    Option[Int]) for usability
>>>>>>>> >>>> >>>    - It is unclear how Spark will find the Java Method to
>>>>>>>> call
>>>>>>>> >>>> >>>    - Use cases that require varargs fall back to casting;
>>>>>>>> users will
>>>>>>>> >>>> >>>    also get this wrong (cast to String instead of
>>>>>>>> UTF8String)
>>>>>>>> >>>> >>>    - The non-codegen path is significantly slower
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> The benefit of invoke is to avoid moving data into a row,
>>>>>>>> like this:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> -- using invoke
>>>>>>>> >>>> >>> int result = udfFunction(x, y)
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> -- using row
>>>>>>>> >>>> >>> udfRow.update(0, x); -- actual: values[0] = x;
>>>>>>>> >>>> >>> udfRow.update(1, y);
>>>>>>>> >>>> >>> int result = udfFunction(udfRow);
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> And, again, that won’t actually help much in cases that
>>>>>>>> require varargs.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> I suggest we add a new marker trait for BoundMethod called
>>>>>>>> >>>> >>> SupportsInvoke.
>>>>>>>> >>>> >>> If that interface is implemented, then Spark will look for
>>>>>>>> a method that
>>>>>>>> >>>> >>> matches the expected signature based on the bound input
>>>>>>>> type. If it
>>>>>>>> >>>> >>> isn’t
>>>>>>>> >>>> >>> found, Spark can print a warning and fall back to the
>>>>>>>> InternalRow call:
>>>>>>>> >>>> >>> “Cannot find udfFunction(int, int)”.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> This approach allows the invoke optimization, but solves
>>>>>>>> many of the
>>>>>>>> >>>> >>> problems:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>    - The method to invoke is found using the proposed load
>>>>>>>> and bind
>>>>>>>> >>>> >>>    approach
>>>>>>>> >>>> >>>    - Magic function signatures are optional and do not
>>>>>>>> cause runtime
>>>>>>>> >>>> >>>    failures
>>>>>>>> >>>> >>>    - Because this is an optional optimization, Spark can
>>>>>>>> be more strict
>>>>>>>> >>>> >>>    about types
>>>>>>>> >>>> >>>    - Varargs cases can still use rows
>>>>>>>> >>>> >>>    - Non-codegen can use an evaluation method rather than
>>>>>>>> falling back
>>>>>>>> >>>> >>>    to slow Java reflection
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> This seems like a good extension to me; this provides a
>>>>>>>> plan for
>>>>>>>> >>>> >>> optimizing the UDF call to avoid building a row, while the
>>>>>>>> existing
>>>>>>>> >>>> >>> proposal covers the other cases well and addresses how to
>>>>>>>> locate these
>>>>>>>> >>>> >>> function calls.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> This also highlights that the approach used in DSv2 and
>>>>>>>> this proposal is
>>>>>>>> >>>> >>> working: start small and use extensions to layer on more
>>>>>>>> complex
>>>>>>>> >>>> >>> support.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > dongjoon.hyun@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;
>>>>>>>> >>>> >>> wrote:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Thank you all for making a giant move forward for Apache
>>>>>>>> Spark 3.2.0.
>>>>>>>> >>>> >>> I'm really looking forward to seeing Wenchen's
>>>>>>>> implementation.
>>>>>>>> >>>> >>> That would be greatly helpful to make a decision!
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> > I'll implement my idea after the holiday and then we can
>>>>>>>> have
>>>>>>>> >>>> >>> more effective discussions. We can also do benchmarks and
>>>>>>>> get some real
>>>>>>>> >>>> >>> numbers.
>>>>>>>> >>>> >>> > FYI: the Presto UDF API
>>>>>>>> >>>> >>> &lt;
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067978066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=iMWmHqqXPcT7EK%2Bovyzhy%2BZpU6Llih%2BwdZD53wvobmc%3D&amp;reserved=0&gt
>>>>>>>> ;
>>>>>>>> >>>> >>> also
>>>>>>>> >>>> >>> takes individual parameters instead of the row parameter.
>>>>>>>> I think this
>>>>>>>> >>>> >>> direction at least worth a try so that we can see the
>>>>>>>> performance
>>>>>>>> >>>> >>> difference. It's also mentioned in the design doc as an
>>>>>>>> alternative
>>>>>>>> >>>> >>> (Trino).
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Bests,
>>>>>>>> >>>> >>> Dongjoon.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > cloud0fan@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt; wrote:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> FYI: the Presto UDF API
>>>>>>>> >>>> >>> &lt;
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZSBCR7yx3PpwL4KY9V73JG42Z02ZodqkjxC0LweHt1g%3D&amp;reserved=0&gt
>>>>>>>> ;
>>>>>>>> >>>> >>> also takes individual parameters instead of the row
>>>>>>>> parameter. I think
>>>>>>>> >>>> >>> this
>>>>>>>> >>>> >>> direction at least worth a try so that we can see the
>>>>>>>> performance
>>>>>>>> >>>> >>> difference. It's also mentioned in the design doc as an
>>>>>>>> alternative
>>>>>>>> >>>> >>> (Trino).
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > cloud0fan@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt; wrote:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Hi Holden,
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> As Hyukjin said, following existing designs is not the
>>>>>>>> principle of DS
>>>>>>>> >>>> >>> v2
>>>>>>>> >>>> >>> API design. We should make sure the DS v2 API makes sense.
>>>>>>>> AFAIK we
>>>>>>>> >>>> >>> didn't
>>>>>>>> >>>> >>> fully follow the catalog API design from Hive and I
>>>>>>>> believe Ryan also
>>>>>>>> >>>> >>> agrees with it.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> I think the problem here is we were discussing some very
>>>>>>>> detailed things
>>>>>>>> >>>> >>> without actual code. I'll implement my idea after the
>>>>>>>> holiday and then
>>>>>>>> >>>> >>> we
>>>>>>>> >>>> >>> can have more effective discussions. We can also do
>>>>>>>> benchmarks and get
>>>>>>>> >>>> >>> some
>>>>>>>> >>>> >>> real numbers.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> In the meantime, we can continue to discuss other parts of
>>>>>>>> this
>>>>>>>> >>>> >>> proposal,
>>>>>>>> >>>> >>> and make a prototype if possible. Spark SQL has many active
>>>>>>>> >>>> >>> contributors/committers and this thread doesn't get much
>>>>>>>> attention yet.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > gurwls223@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt; wrote:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Just dropping a few lines. I remember that one of the
>>>>>>>> goals in DSv2 is
>>>>>>>> >>>> >>> to
>>>>>>>> >>>> >>> correct the mistakes we made in the current Spark codes.
>>>>>>>> >>>> >>> It would not have much point if we will happen to just
>>>>>>>> follow and mimic
>>>>>>>> >>>> >>> what Spark currently does. It might just end up with
>>>>>>>> another copy of
>>>>>>>> >>>> >>> Spark
>>>>>>>> >>>> >>> APIs, e.g. Expression (internal) APIs. I sincerely would
>>>>>>>> like to avoid
>>>>>>>> >>>> >>> this
>>>>>>>> >>>> >>> I do believe we have been stuck mainly due to trying to
>>>>>>>> come up with a
>>>>>>>> >>>> >>> better design. We already have an ugly picture of the
>>>>>>>> current Spark APIs
>>>>>>>> >>>> >>> to
>>>>>>>> >>>> >>> draw a better bigger picture.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > holden@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt;님이 작성:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> I think this proposal is a good set of trade-offs and has
>>>>>>>> existed in the
>>>>>>>> >>>> >>> community for a long period of time. I especially
>>>>>>>> appreciate how the
>>>>>>>> >>>> >>> design
>>>>>>>> >>>> >>> is focused on a minimal useful component, with future
>>>>>>>> optimizations
>>>>>>>> >>>> >>> considered from a point of view of making sure it's
>>>>>>>> flexible, but actual
>>>>>>>> >>>> >>> concrete decisions left for the future once we see how
>>>>>>>> this API is used.
>>>>>>>> >>>> >>> I
>>>>>>>> >>>> >>> think if we try and optimize everything right out of the
>>>>>>>> gate, we'll
>>>>>>>> >>>> >>> quickly get stuck (again) and not make any progress.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue &lt;
>>>>>>>> >>>>
>>>>>>>> >>>> > blue@
>>>>>>>> >>>>
>>>>>>>> >>>> > &gt; wrote:
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Hi everyone,
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> I'd like to start a discussion for adding a
>>>>>>>> FunctionCatalog interface to
>>>>>>>> >>>> >>> catalog plugins. This will allow catalogs to expose
>>>>>>>> functions to Spark,
>>>>>>>> >>>> >>> similar to how the TableCatalog interface allows a catalog
>>>>>>>> to expose
>>>>>>>> >>>> >>> tables. The proposal doc is available here:
>>>>>>>> >>>> >>>
>>>>>>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>>>>>> >>>> >>> &lt;
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U%2Fedit&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Kyth8%2FhNUZ6GXG2FsgcknZ7t7s0%2BpxnDMPyxvsxLLqE%3D&amp;reserved=0&gt
>>>>>>>> ;
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Here's a high-level summary of some of the main design
>>>>>>>> choices:
>>>>>>>> >>>> >>> * Adds the ability to list and load functions, not to
>>>>>>>> create or modify
>>>>>>>> >>>> >>> them in an external catalog
>>>>>>>> >>>> >>> * Supports scalar, aggregate, and partial aggregate
>>>>>>>> functions
>>>>>>>> >>>> >>> * Uses load and bind steps for better error messages and
>>>>>>>> simpler
>>>>>>>> >>>> >>> implementations
>>>>>>>> >>>> >>> * Like the DSv2 table read and write APIs, it uses
>>>>>>>> InternalRow to pass
>>>>>>>> >>>> >>> data
>>>>>>>> >>>> >>> * Can be extended using mix-in interfaces to add
>>>>>>>> vectorization, codegen,
>>>>>>>> >>>> >>> and other future features
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> There is also a PR with the proposed API:
>>>>>>>> >>>> >>> https://github.com/apache/spark/pull/24559/files
>>>>>>>> >>>> >>> &lt;
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F24559%2Ffiles&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t3ZCqffdsrmCY3X%2FT8x1oMjMcNUiQ0wQNk%2ByAXQx1Io%3D&amp;reserved=0&gt
>>>>>>>> ;
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> Let's discuss the proposal here rather than on that PR, to
>>>>>>>> get better
>>>>>>>> >>>> >>> visibility. Also, please take the time to read the
>>>>>>>> proposal first. That
>>>>>>>> >>>> >>> really helps clear up misconceptions.
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> --
>>>>>>>> >>>> >>> Ryan Blue
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> --
>>>>>>>> >>>> >>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> >>>> >>> &lt;
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D&amp;reserved=0&gt
>>>>>>>> ;
>>>>>>>> >>>> >>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> >>>> >>> https://amzn.to/2MaRAG9
>>>>>>>> >>>> >>> &lt;
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D&amp;reserved=0&gt
>>>>>>>> ;
>>>>>>>> >>>> >>> YouTube Live Streams:
>>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>>> >>>> >>> &lt;
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D&amp;reserved=0&gt
>>>>>>>> ;
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>> --
>>>>>>>> >>>> >>> Ryan Blue
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >>>
>>>>>>>> >>>> >
>>>>>>>> >>>> > --
>>>>>>>> >>>> > John Zhuge
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> --
>>>>>>>> >>>> Sent from:
>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> >>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>> >>>>
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Ryan Blue
>>>>>>>> > Software Engineer
>>>>>>>> > Netflix
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Mime
View raw message