spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: [DISCUSS] SPIP: FunctionCatalog
Date Wed, 17 Feb 2021 10:48:46 GMT
I did a simple benchmark (adding two long values) to compare the
performance between
1. native expression
2. the current UDF
3. new UDF with individual parameters
4. new UDF with a row parameter (with the row object cached)
5. invoke a static method (to explore the possibility of speeding up
stateless UDF, not very related to the current topic)

The benchmark code can be found here
<https://gist.github.com/cloud-fan/f88baf770fa0c6f9ad312e8c92ff6c21>. The
result is

Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.6
Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
UDF perf:                                 Best Time(ms)   Avg Time(ms)
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
native add                                        14206          14516
    535         70.4          14.2       1.0X
udf add                                           24609          25271
    898         40.6          24.6       0.6X
new udf add                                       18657          19096
    726         53.6          18.7       0.8X
new row udf add                                   21128          22343
   1478         47.3          21.1       0.7X
static udf add                                    16678          16887
    278         60.0          16.7       0.9X


The new UDF with individual parameters is faster than the current UDF,
because the virtual function call is eliminated. It's also faster than the
row parameter version because of no overhead to set/get row fields.

I prefer the individual-parameters version, not only because of the
performance gain (10% is not a big win), but also because:
1. It's coherent with the current Scala/Java UDF API
2. It's simpler for developers to write simple UDFs (parameters are the
input columns directly).
3. It's possible to allow multiple java types for one catalyst type, e.g.
allowing both String and UTF8String, which is more flexible.

One major issue is not supporting varargs, but I'm not sure how
important this feature is. As I mentioned before, users can work around it
by accepting struct-type input and use the `struct` function to build the
input column. The current Scala/Java UDF doesn't support varargs either,
the same to Presto/Transport.

I'm fine to have an optional trait or flag to support varargs by accepting
InternalRow as the input, if there are user requests.

About debugging, I don't see a big issue here as the process of calling the
new UDF is very similar to the current Scala/Java UDF. Please let me know
if there are existing complaints about debugging the current Scala/Java
UDF. I think the row-parameter version is even harder to debug, as the
column binding happens in the user code (e.g. row.getLong(index)) which is
totally runtime, while the individual-parameters version has a
query-compile-time check to make sure the function signature matches the
input columns.

I can help to come up with detailed rules about null handling, type
matching, etc. for the individual-parameters UDF, if we all agree with this
direction.

Last but not least, calling methods via reflection (searching the method
handler only needs to be done once per task) is not that slow in modern
JVMs. Non-codegen is like 10x slower and I don't think a bit overhead in
Java reflection matters.



On Wed, Feb 17, 2021 at 3:07 PM Hyukjin Kwon <gurwls223@gmail.com> wrote:

> Just to make sure we don’t move past, I think we haven’t decided yet:
>
>    - if we’ll replace the current proposal to Wenchen’s approach as the
>    default
>    - if we want to have Wenchen’s approach as an optional mix-in on the
>    top of Ryan’s proposal (SupportsInvoke)
>
> From what I read, some people pointed out it as a replacement. Please
> correct me if I misread this discussion thread.
> As Dongjoon pointed out, it would be good to know rough ETA to make sure
> making progress in this, and people can compare more easily.
>
>
> FWIW, there’s the saying I like in the zen of Python
> <https://www.python.org/dev/peps/pep-0020/>:
>
> There should be one— and preferably only one —obvious way to do it.
>
> If multiple approaches have the way for developers to do the (almost) same
> thing, I would prefer to avoid it.
>
> In addition, I would prefer to focus on what Spark does by default first.
>
>
> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun <dongjoon.hyun@gmail.com>님이
작성:
>
>> Hi, Wenchen.
>>
>> This thread seems to get enough attention. Also, I'm expecting more and
>> more if we have this on the `master` branch because we are developing
>> together.
>>
>>     > Spark SQL has many active contributors/committers and this thread
>> doesn't get much attention yet.
>>
>> So, what's your ETA from now?
>>
>>     > I think the problem here is we were discussing some very detailed
>> things without actual code.
>>     > I'll implement my idea after the holiday and then we can have more
>> effective discussions.
>>     > We can also do benchmarks and get some real numbers.
>>     > In the meantime, we can continue to discuss other parts of this
>> proposal, and make a prototype if possible.
>>
>> I'm looking forward to seeing your PR. I hope we can conclude this thread
>> and have at least one implementation in the `master` branch this month
>> (February).
>> If you need more time (one month or longer), why don't we have Ryan's
>> suggestion in the `master` branch first and benchmark with your PR later
>> during Apache Spark 3.2 timeframe.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue <rblue@netflix.com.invalid>
>> wrote:
>>
>>> Andrew,
>>>
>>> The proposal already includes an API for aggregate functions and I think
>>> we would want to implement those right away.
>>>
>>> Processing ColumnBatch is something we can easily extend the interfaces
>>> to support, similar to Wenchen's suggestion. The important thing right now
>>> is to agree on some basic functionality: how to look up functions and what
>>> the simple API should be. Like the TableCatalog interfaces, we will layer
>>> on more support through optional interfaces like `SupportsInvoke` or
>>> `SupportsColumnBatch`.
>>>
>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo <andrew.melo@gmail.com>
>>> wrote:
>>>
>>>> Hello Ryan,
>>>>
>>>> This proposal looks very interesting. Would future goals for this
>>>> functionality include both support for aggregation functions, as well
>>>> as support for processing ColumnBatch-es (instead of Row/InternalRow)?
>>>>
>>>> Thanks
>>>> Andrew
>>>>
>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue <rblue@netflix.com.invalid>
>>>> wrote:
>>>> >
>>>> > Thanks for the positive feedback, everyone. It sounds like there is
a
>>>> clear path forward for calling functions. Even without a prototype, the
>>>> `invoke` plans show that Wenchen's suggested optimization can be done, and
>>>> incorporating it as an optional extension to this proposal solves many of
>>>> the unknowns.
>>>> >
>>>> > With that area now understood, is there any discussion about other
>>>> parts of the proposal, besides the function call interface?
>>>> >
>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun <sunchao@apache.org>
wrote:
>>>> >>
>>>> >> This is an important feature which can unblock several other
>>>> projects including bucket join support for DataSource v2, complete support
>>>> for enforcing DataSource v2 distribution requirements on the write path,
>>>> etc. I like Ryan's proposals which look simple and elegant, with nice
>>>> support on function overloading and variadic arguments. On the other hand,
>>>> I think Wenchen made a very good point about performance. Overall, I'm
>>>> excited to see active discussions on this topic and believe the community
>>>> will come to a proposal with the best of both sides.
>>>> >>
>>>> >> Chao
>>>> >>
>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <gurwls223@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> +1 for Liang-chi's.
>>>> >>>
>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>> >>>
>>>> >>>
>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh <viirya@gmail.com>님이
작성:
>>>> >>>>
>>>> >>>> Basically I think the proposal makes sense to me and I'd
like to
>>>> support the
>>>> >>>> SPIP as it looks like we have strong need for the important
>>>> feature.
>>>> >>>>
>>>> >>>> Thanks Ryan for working on this and I do also look forward
to
>>>> Wenchen's
>>>> >>>> implementation. Thanks for the discussion too.
>>>> >>>>
>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks
a good
>>>> >>>> alternative to me. Besides Wenchen's alternative implementation,
>>>> is there a
>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>> >>>>
>>>> >>>>
>>>> >>>> John Zhuge wrote
>>>> >>>> > Excited to see our Spark community rallying behind
this
>>>> important feature!
>>>> >>>> >
>>>> >>>> > The proposal lays a solid foundation of minimal feature
set with
>>>> careful
>>>> >>>> > considerations for future optimizations and extensions.
Can't
>>>> wait to see
>>>> >>>> > it leading to more advanced functionalities like views
with
>>>> shared custom
>>>> >>>> > functions, function pushdown, lambda, etc. It has already
borne
>>>> fruit from
>>>> >>>> > the constructive collaborations in this thread. Looking
forward
>>>> to
>>>> >>>> > Wenchen's prototype and further discussions including
the
>>>> SupportsInvoke
>>>> >>>> > extension proposed by Ryan.
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley &lt;
>>>> >>>>
>>>> >>>> > owen.omalley@
>>>> >>>>
>>>> >>>> > &gt;
>>>> >>>> > wrote:
>>>> >>>> >
>>>> >>>> >> I think this proposal is a very good thing giving
Spark a
>>>> standard way of
>>>> >>>> >> getting to and calling UDFs.
>>>> >>>> >>
>>>> >>>> >> I like having the ScalarFunction as the API to
call the UDFs.
>>>> It is
>>>> >>>> >> simple, yet covers all of the polymorphic type
cases well. I
>>>> think it
>>>> >>>> >> would
>>>> >>>> >> also simplify using the functions in other contexts
like
>>>> pushing down
>>>> >>>> >> filters into the ORC & Parquet readers although
there are a lot
>>>> of
>>>> >>>> >> details
>>>> >>>> >> that would need to be considered there.
>>>> >>>> >>
>>>> >>>> >> .. Owen
>>>> >>>> >>
>>>> >>>> >>
>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen &lt;
>>>> >>>>
>>>> >>>> > ekrogen@.com
>>>> >>>>
>>>> >>>> > &gt;
>>>> >>>> >> wrote:
>>>> >>>> >>
>>>> >>>> >>> I agree that there is a strong need for a FunctionCatalog
>>>> within Spark
>>>> >>>> >>> to
>>>> >>>> >>> provide support for shareable UDFs, as well
as make movement
>>>> towards
>>>> >>>> >>> more
>>>> >>>> >>> advanced functionality like views which themselves
depend on
>>>> UDFs, so I
>>>> >>>> >>> support this SPIP wholeheartedly.
>>>> >>>> >>>
>>>> >>>> >>> I find both of the proposed UDF APIs to be
sufficiently
>>>> user-friendly
>>>> >>>> >>> and
>>>> >>>> >>> extensible. I generally think Wenchen's proposal
is easier for
>>>> a user to
>>>> >>>> >>> work with in the common case, but has greater
potential for
>>>> confusing
>>>> >>>> >>> and
>>>> >>>> >>> hard-to-debug behavior due to use of reflective
method
>>>> signature
>>>> >>>> >>> searches.
>>>> >>>> >>> The merits on both sides can hopefully be more
properly
>>>> examined with
>>>> >>>> >>> code,
>>>> >>>> >>> so I look forward to seeing an implementation
of Wenchen's
>>>> ideas to
>>>> >>>> >>> provide
>>>> >>>> >>> a more concrete comparison. I am optimistic
that we will not
>>>> let the
>>>> >>>> >>> debate
>>>> >>>> >>> over this point unreasonably stall the SPIP
from making
>>>> progress.
>>>> >>>> >>>
>>>> >>>> >>> Thank you to both Wenchen and Ryan for your
detailed
>>>> consideration and
>>>> >>>> >>> evaluation of these ideas!
>>>> >>>> >>> ------------------------------
>>>> >>>> >>> *From:* Dongjoon Hyun &lt;
>>>> >>>>
>>>> >>>> > dongjoon.hyun@
>>>> >>>>
>>>> >>>> > &gt;
>>>> >>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>>> >>>> >>> *To:* Ryan Blue &lt;
>>>> >>>>
>>>> >>>> > blue@
>>>> >>>>
>>>> >>>> > &gt;
>>>> >>>> >>> *Cc:* Holden Karau &lt;
>>>> >>>>
>>>> >>>> > holden@
>>>> >>>>
>>>> >>>> > &gt;; Hyukjin Kwon <
>>>> >>>> >>>
>>>> >>>>
>>>> >>>> > gurwls223@
>>>> >>>>
>>>> >>>> >>; Spark Dev List &lt;
>>>> >>>>
>>>> >>>> > dev@.apache
>>>> >>>>
>>>> >>>> > &gt;; Wenchen Fan
>>>> >>>> >>> &lt;
>>>> >>>>
>>>> >>>> > cloud0fan@
>>>> >>>>
>>>> >>>> > &gt;
>>>> >>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>> >>>> >>>
>>>> >>>> >>> BTW, I forgot to add my opinion explicitly
in this thread
>>>> because I was
>>>> >>>> >>> on the PR before this thread.
>>>> >>>> >>>
>>>> >>>> >>> 1. The `FunctionCatalog API` PR was made on
May 9, 2019 and
>>>> has been
>>>> >>>> >>> there for almost two years.
>>>> >>>> >>> 2. I already gave my +1 on that PR last Saturday
because I
>>>> agreed with
>>>> >>>> >>> the latest updated design docs and AS-IS PR.
>>>> >>>> >>>
>>>> >>>> >>> And, the rest of the progress in this thread
is also very
>>>> satisfying to
>>>> >>>> >>> me.
>>>> >>>> >>> (e.g. Ryan's extension suggestion and Wenchen's
alternative)
>>>> >>>> >>>
>>>> >>>> >>> To All:
>>>> >>>> >>> Please take a look at the design doc and the
PR, and give us
>>>> some
>>>> >>>> >>> opinions.
>>>> >>>> >>> We really need your participation in order
to make DSv2 more
>>>> complete.
>>>> >>>> >>> This will unblock other DSv2 features, too.
>>>> >>>> >>>
>>>> >>>> >>> Bests,
>>>> >>>> >>> Dongjoon.
>>>> >>>> >>>
>>>> >>>> >>>
>>>> >>>> >>>
>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun
&lt;
>>>> >>>>
>>>> >>>> > dongjoon.hyun@
>>>> >>>>
>>>> >>>> > &gt;
>>>> >>>> >>> wrote:
>>>> >>>> >>>
>>>> >>>> >>> Hi, Ryan.
>>>> >>>> >>>
>>>> >>>> >>> We didn't move past anything (both yours and
Wenchen's). What
>>>> Wenchen
>>>> >>>> >>> suggested is double-checking the alternatives
with the
>>>> implementation to
>>>> >>>> >>> give more momentum to our discussion.
>>>> >>>> >>>
>>>> >>>> >>> Your new suggestion about optional extention
also sounds like
>>>> a new
>>>> >>>> >>> reasonable alternative to me.
>>>> >>>> >>>
>>>> >>>> >>> We are still discussing this topic together
and I hope we can
>>>> make a
>>>> >>>> >>> conclude at this time (for Apache Spark 3.2)
without being
>>>> stucked like
>>>> >>>> >>> last time.
>>>> >>>> >>>
>>>> >>>> >>> I really appreciate your leadership in this
dicsussion and the
>>>> moving
>>>> >>>> >>> direction of this discussion looks constructive
to me. Let's
>>>> give some
>>>> >>>> >>> time
>>>> >>>> >>> to the alternatives.
>>>> >>>> >>>
>>>> >>>> >>> Bests,
>>>> >>>> >>> Dongjoon.
>>>> >>>> >>>
>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue
&lt;
>>>> >>>>
>>>> >>>> > blue@
>>>> >>>>
>>>> >>>> > &gt; wrote:
>>>> >>>> >>>
>>>> >>>> >>> I don’t think we should so quickly move past
the drawbacks of
>>>> this
>>>> >>>> >>> approach. The problems are significant enough
that using
>>>> invoke is not
>>>> >>>> >>> sufficient on its own. But, I think we can
add it as an
>>>> optional
>>>> >>>> >>> extension
>>>> >>>> >>> to shore up the weaknesses.
>>>> >>>> >>>
>>>> >>>> >>> Here’s a summary of the drawbacks:
>>>> >>>> >>>
>>>> >>>> >>>    - Magic function signatures are error-prone
>>>> >>>> >>>    - Spark would need considerable code to
help users find
>>>> what went
>>>> >>>> >>>    wrong
>>>> >>>> >>>    - Spark would likely need to coerce arguments
(e.g., String,
>>>> >>>> >>>    Option[Int]) for usability
>>>> >>>> >>>    - It is unclear how Spark will find the
Java Method to call
>>>> >>>> >>>    - Use cases that require varargs fall back
to casting;
>>>> users will
>>>> >>>> >>>    also get this wrong (cast to String instead
of UTF8String)
>>>> >>>> >>>    - The non-codegen path is significantly
slower
>>>> >>>> >>>
>>>> >>>> >>> The benefit of invoke is to avoid moving data
into a row, like
>>>> this:
>>>> >>>> >>>
>>>> >>>> >>> -- using invoke
>>>> >>>> >>> int result = udfFunction(x, y)
>>>> >>>> >>>
>>>> >>>> >>> -- using row
>>>> >>>> >>> udfRow.update(0, x); -- actual: values[0] =
x;
>>>> >>>> >>> udfRow.update(1, y);
>>>> >>>> >>> int result = udfFunction(udfRow);
>>>> >>>> >>>
>>>> >>>> >>> And, again, that won’t actually help much
in cases that
>>>> require varargs.
>>>> >>>> >>>
>>>> >>>> >>> I suggest we add a new marker trait for BoundMethod
called
>>>> >>>> >>> SupportsInvoke.
>>>> >>>> >>> If that interface is implemented, then Spark
will look for a
>>>> method that
>>>> >>>> >>> matches the expected signature based on the
bound input type.
>>>> If it
>>>> >>>> >>> isn’t
>>>> >>>> >>> found, Spark can print a warning and fall back
to the
>>>> InternalRow call:
>>>> >>>> >>> “Cannot find udfFunction(int, int)”.
>>>> >>>> >>>
>>>> >>>> >>> This approach allows the invoke optimization,
but solves many
>>>> of the
>>>> >>>> >>> problems:
>>>> >>>> >>>
>>>> >>>> >>>    - The method to invoke is found using the
proposed load and
>>>> bind
>>>> >>>> >>>    approach
>>>> >>>> >>>    - Magic function signatures are optional
and do not cause
>>>> runtime
>>>> >>>> >>>    failures
>>>> >>>> >>>    - Because this is an optional optimization,
Spark can be
>>>> more strict
>>>> >>>> >>>    about types
>>>> >>>> >>>    - Varargs cases can still use rows
>>>> >>>> >>>    - Non-codegen can use an evaluation method
rather than
>>>> falling back
>>>> >>>> >>>    to slow Java reflection
>>>> >>>> >>>
>>>> >>>> >>> This seems like a good extension to me; this
provides a plan
>>>> for
>>>> >>>> >>> optimizing the UDF call to avoid building a
row, while the
>>>> existing
>>>> >>>> >>> proposal covers the other cases well and addresses
how to
>>>> locate these
>>>> >>>> >>> function calls.
>>>> >>>> >>>
>>>> >>>> >>> This also highlights that the approach used
in DSv2 and this
>>>> proposal is
>>>> >>>> >>> working: start small and use extensions to
layer on more
>>>> complex
>>>> >>>> >>> support.
>>>> >>>> >>>
>>>> >>>> >>> On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun
&lt;
>>>> >>>>
>>>> >>>> > dongjoon.hyun@
>>>> >>>>
>>>> >>>> > &gt;
>>>> >>>> >>> wrote:
>>>> >>>> >>>
>>>> >>>> >>> Thank you all for making a giant move forward
for Apache Spark
>>>> 3.2.0.
>>>> >>>> >>> I'm really looking forward to seeing Wenchen's
implementation.
>>>> >>>> >>> That would be greatly helpful to make a decision!
>>>> >>>> >>>
>>>> >>>> >>> > I'll implement my idea after the holiday
and then we can have
>>>> >>>> >>> more effective discussions. We can also do
benchmarks and get
>>>> some real
>>>> >>>> >>> numbers.
>>>> >>>> >>> > FYI: the Presto UDF API
>>>> >>>> >>> &lt;
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067978066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=iMWmHqqXPcT7EK%2Bovyzhy%2BZpU6Llih%2BwdZD53wvobmc%3D&amp;reserved=0&gt
>>>> ;
>>>> >>>> >>> also
>>>> >>>> >>> takes individual parameters instead of the
row parameter. I
>>>> think this
>>>> >>>> >>> direction at least worth a try so that we can
see the
>>>> performance
>>>> >>>> >>> difference. It's also mentioned in the design
doc as an
>>>> alternative
>>>> >>>> >>> (Trino).
>>>> >>>> >>>
>>>> >>>> >>> Bests,
>>>> >>>> >>> Dongjoon.
>>>> >>>> >>>
>>>> >>>> >>>
>>>> >>>> >>> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan
&lt;
>>>> >>>>
>>>> >>>> > cloud0fan@
>>>> >>>>
>>>> >>>> > &gt; wrote:
>>>> >>>> >>>
>>>> >>>> >>> FYI: the Presto UDF API
>>>> >>>> >>> &lt;
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZSBCR7yx3PpwL4KY9V73JG42Z02ZodqkjxC0LweHt1g%3D&amp;reserved=0&gt
>>>> ;
>>>> >>>> >>> also takes individual parameters instead of
the row parameter.
>>>> I think
>>>> >>>> >>> this
>>>> >>>> >>> direction at least worth a try so that we can
see the
>>>> performance
>>>> >>>> >>> difference. It's also mentioned in the design
doc as an
>>>> alternative
>>>> >>>> >>> (Trino).
>>>> >>>> >>>
>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan
&lt;
>>>> >>>>
>>>> >>>> > cloud0fan@
>>>> >>>>
>>>> >>>> > &gt; wrote:
>>>> >>>> >>>
>>>> >>>> >>> Hi Holden,
>>>> >>>> >>>
>>>> >>>> >>> As Hyukjin said, following existing designs
is not the
>>>> principle of DS
>>>> >>>> >>> v2
>>>> >>>> >>> API design. We should make sure the DS v2 API
makes sense.
>>>> AFAIK we
>>>> >>>> >>> didn't
>>>> >>>> >>> fully follow the catalog API design from Hive
and I believe
>>>> Ryan also
>>>> >>>> >>> agrees with it.
>>>> >>>> >>>
>>>> >>>> >>> I think the problem here is we were discussing
some very
>>>> detailed things
>>>> >>>> >>> without actual code. I'll implement my idea
after the holiday
>>>> and then
>>>> >>>> >>> we
>>>> >>>> >>> can have more effective discussions. We can
also do benchmarks
>>>> and get
>>>> >>>> >>> some
>>>> >>>> >>> real numbers.
>>>> >>>> >>>
>>>> >>>> >>> In the meantime, we can continue to discuss
other parts of this
>>>> >>>> >>> proposal,
>>>> >>>> >>> and make a prototype if possible. Spark SQL
has many active
>>>> >>>> >>> contributors/committers and this thread doesn't
get much
>>>> attention yet.
>>>> >>>> >>>
>>>> >>>> >>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon
&lt;
>>>> >>>>
>>>> >>>> > gurwls223@
>>>> >>>>
>>>> >>>> > &gt; wrote:
>>>> >>>> >>>
>>>> >>>> >>> Just dropping a few lines. I remember that
one of the goals in
>>>> DSv2 is
>>>> >>>> >>> to
>>>> >>>> >>> correct the mistakes we made in the current
Spark codes.
>>>> >>>> >>> It would not have much point if we will happen
to just follow
>>>> and mimic
>>>> >>>> >>> what Spark currently does. It might just end
up with another
>>>> copy of
>>>> >>>> >>> Spark
>>>> >>>> >>> APIs, e.g. Expression (internal) APIs. I sincerely
would like
>>>> to avoid
>>>> >>>> >>> this
>>>> >>>> >>> I do believe we have been stuck mainly due
to trying to come
>>>> up with a
>>>> >>>> >>> better design. We already have an ugly picture
of the current
>>>> Spark APIs
>>>> >>>> >>> to
>>>> >>>> >>> draw a better bigger picture.
>>>> >>>> >>>
>>>> >>>> >>>
>>>> >>>> >>> 2021년 2월 10일 (수) 오전 3:28, Holden
Karau &lt;
>>>> >>>>
>>>> >>>> > holden@
>>>> >>>>
>>>> >>>> > &gt;님이 작성:
>>>> >>>> >>>
>>>> >>>> >>> I think this proposal is a good set of trade-offs
and has
>>>> existed in the
>>>> >>>> >>> community for a long period of time. I especially
appreciate
>>>> how the
>>>> >>>> >>> design
>>>> >>>> >>> is focused on a minimal useful component, with
future
>>>> optimizations
>>>> >>>> >>> considered from a point of view of making sure
it's flexible,
>>>> but actual
>>>> >>>> >>> concrete decisions left for the future once
we see how this
>>>> API is used.
>>>> >>>> >>> I
>>>> >>>> >>> think if we try and optimize everything right
out of the gate,
>>>> we'll
>>>> >>>> >>> quickly get stuck (again) and not make any
progress.
>>>> >>>> >>>
>>>> >>>> >>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue &lt;
>>>> >>>>
>>>> >>>> > blue@
>>>> >>>>
>>>> >>>> > &gt; wrote:
>>>> >>>> >>>
>>>> >>>> >>> Hi everyone,
>>>> >>>> >>>
>>>> >>>> >>> I'd like to start a discussion for adding a
FunctionCatalog
>>>> interface to
>>>> >>>> >>> catalog plugins. This will allow catalogs to
expose functions
>>>> to Spark,
>>>> >>>> >>> similar to how the TableCatalog interface allows
a catalog to
>>>> expose
>>>> >>>> >>> tables. The proposal doc is available here:
>>>> >>>> >>>
>>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>> >>>> >>> &lt;
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U%2Fedit&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Kyth8%2FhNUZ6GXG2FsgcknZ7t7s0%2BpxnDMPyxvsxLLqE%3D&amp;reserved=0&gt
>>>> ;
>>>> >>>> >>>
>>>> >>>> >>> Here's a high-level summary of some of the
main design choices:
>>>> >>>> >>> * Adds the ability to list and load functions,
not to create
>>>> or modify
>>>> >>>> >>> them in an external catalog
>>>> >>>> >>> * Supports scalar, aggregate, and partial aggregate
functions
>>>> >>>> >>> * Uses load and bind steps for better error
messages and
>>>> simpler
>>>> >>>> >>> implementations
>>>> >>>> >>> * Like the DSv2 table read and write APIs,
it uses InternalRow
>>>> to pass
>>>> >>>> >>> data
>>>> >>>> >>> * Can be extended using mix-in interfaces to
add
>>>> vectorization, codegen,
>>>> >>>> >>> and other future features
>>>> >>>> >>>
>>>> >>>> >>> There is also a PR with the proposed API:
>>>> >>>> >>> https://github.com/apache/spark/pull/24559/files
>>>> >>>> >>> &lt;
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F24559%2Ffiles&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t3ZCqffdsrmCY3X%2FT8x1oMjMcNUiQ0wQNk%2ByAXQx1Io%3D&amp;reserved=0&gt
>>>> ;
>>>> >>>> >>>
>>>> >>>> >>> Let's discuss the proposal here rather than
on that PR, to get
>>>> better
>>>> >>>> >>> visibility. Also, please take the time to read
the proposal
>>>> first. That
>>>> >>>> >>> really helps clear up misconceptions.
>>>> >>>> >>>
>>>> >>>> >>>
>>>> >>>> >>>
>>>> >>>> >>> --
>>>> >>>> >>> Ryan Blue
>>>> >>>> >>>
>>>> >>>> >>>
>>>> >>>> >>>
>>>> >>>> >>> --
>>>> >>>> >>> Twitter: https://twitter.com/holdenkarau
>>>> >>>> >>> &lt;
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D&amp;reserved=0&gt
>>>> ;
>>>> >>>> >>> Books (Learning Spark, High Performance Spark,
etc.):
>>>> >>>> >>> https://amzn.to/2MaRAG9
>>>> >>>> >>> &lt;
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D&amp;reserved=0&gt
>>>> ;
>>>> >>>> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> >>>> >>> &lt;
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D&amp;reserved=0&gt
>>>> ;
>>>> >>>> >>>
>>>> >>>> >>> --
>>>> >>>> >>> Ryan Blue
>>>> >>>> >>>
>>>> >>>> >>>
>>>> >>>> >
>>>> >>>> > --
>>>> >>>> > John Zhuge
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Sent from:
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>> >>>>
>>>> >>>>
>>>> ---------------------------------------------------------------------
>>>> >>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>> >>>>
>>>> >
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Software Engineer
>>>> > Netflix
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Mime
View raw message