spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: [DISCUSS] SPIP: FunctionCatalog
Date Wed, 10 Feb 2021 17:03:46 GMT
Thank you all for making a giant move forward for Apache Spark 3.2.0.
I'm really looking forward to seeing Wenchen's implementation.
That would be greatly helpful to make a decision!

> I'll implement my idea after the holiday and then we can have
more effective discussions. We can also do benchmarks and get some real
numbers.
> FYI: the Presto UDF API
<https://prestodb.io/docs/current/develop/functions.html> also
takes individual parameters instead of the row parameter. I think this
direction at least worth a try so that we can see the performance
difference. It's also mentioned in the design doc as an alternative (Trino).

Bests,
Dongjoon.


On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan <cloud0fan@gmail.com> wrote:

> FYI: the Presto UDF API
> <https://prestodb.io/docs/current/develop/functions.html> also
> takes individual parameters instead of the row parameter. I think this
> direction at least worth a try so that we can see the performance
> difference. It's also mentioned in the design doc as an alternative (Trino).
>
> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> Hi Holden,
>>
>> As Hyukjin said, following existing designs is not the principle of DS v2
>> API design. We should make sure the DS v2 API makes sense. AFAIK we didn't
>> fully follow the catalog API design from Hive and I believe Ryan also
>> agrees with it.
>>
>> I think the problem here is we were discussing some very detailed things
>> without actual code. I'll implement my idea after the holiday and then we
>> can have more effective discussions. We can also do benchmarks and get some
>> real numbers.
>>
>> In the meantime, we can continue to discuss other parts of this proposal,
>> and make a prototype if possible. Spark SQL has many active
>> contributors/committers and this thread doesn't get much attention yet.
>>
>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>>
>>> Just dropping a few lines. I remember that one of the goals in DSv2 is
>>> to correct the mistakes we made in the current Spark codes.
>>> It would not have much point if we will happen to just follow and mimic
>>> what Spark currently does. It might just end up with another copy of Spark
>>> APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this
>>> I do believe we have been stuck mainly due to trying to come up with a
>>> better design. We already have an ugly picture of the current Spark APIs to
>>> draw a better bigger picture.
>>>
>>>
>>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau <holden@pigscanfly.ca>님이
작성:
>>>
>>>> I think this proposal is a good set of trade-offs and has existed in
>>>> the community for a long period of time. I especially appreciate how the
>>>> design is focused on a minimal useful component, with future optimizations
>>>> considered from a point of view of making sure it's flexible, but actual
>>>> concrete decisions left for the future once we see how this API is used.
I
>>>> think if we try and optimize everything right out of the gate, we'll
>>>> quickly get stuck (again) and not make any progress.
>>>>
>>>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue <blue@apache.org> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I'd like to start a discussion for adding a FunctionCatalog interface
>>>>> to catalog plugins. This will allow catalogs to expose functions to Spark,
>>>>> similar to how the TableCatalog interface allows a catalog to expose
>>>>> tables. The proposal doc is available here:
>>>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>>>
>>>>> Here's a high-level summary of some of the main design choices:
>>>>> * Adds the ability to list and load functions, not to create or modify
>>>>> them in an external catalog
>>>>> * Supports scalar, aggregate, and partial aggregate functions
>>>>> * Uses load and bind steps for better error messages and simpler
>>>>> implementations
>>>>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>>>>> data
>>>>> * Can be extended using mix-in interfaces to add vectorization,
>>>>> codegen, and other future features
>>>>>
>>>>> There is also a PR with the proposed API:
>>>>> https://github.com/apache/spark/pull/24559/files
>>>>>
>>>>> Let's discuss the proposal here rather than on that PR, to get better
>>>>> visibility. Also, please take the time to read the proposal first. That
>>>>> really helps clear up misconceptions.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>

Mime
View raw message