spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: [DISCUSS] SPIP: FunctionCatalog
Date Wed, 10 Feb 2021 02:18:39 GMT
Hi Holden,

As Hyukjin said, following existing designs is not the principle of DS v2
API design. We should make sure the DS v2 API makes sense. AFAIK we didn't
fully follow the catalog API design from Hive and I believe Ryan also
agrees with it.

I think the problem here is we were discussing some very detailed things
without actual code. I'll implement my idea after the holiday and then we
can have more effective discussions. We can also do benchmarks and get some
real numbers.

In the meantime, we can continue to discuss other parts of this proposal,
and make a prototype if possible. Spark SQL has many active
contributors/committers and this thread doesn't get much attention yet.

On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon <gurwls223@gmail.com> wrote:

> Just dropping a few lines. I remember that one of the goals in DSv2 is to
> correct the mistakes we made in the current Spark codes.
> It would not have much point if we will happen to just follow and mimic
> what Spark currently does. It might just end up with another copy of Spark
> APIs, e.g. Expression (internal) APIs. I sincerely would like to avoid this
> I do believe we have been stuck mainly due to trying to come up with a
> better design. We already have an ugly picture of the current Spark APIs to
> draw a better bigger picture.
>
>
> 2021년 2월 10일 (수) 오전 3:28, Holden Karau <holden@pigscanfly.ca>님이
작성:
>
>> I think this proposal is a good set of trade-offs and has existed in the
>> community for a long period of time. I especially appreciate how the design
>> is focused on a minimal useful component, with future optimizations
>> considered from a point of view of making sure it's flexible, but actual
>> concrete decisions left for the future once we see how this API is used. I
>> think if we try and optimize everything right out of the gate, we'll
>> quickly get stuck (again) and not make any progress.
>>
>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue <blue@apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to start a discussion for adding a FunctionCatalog interface to
>>> catalog plugins. This will allow catalogs to expose functions to Spark,
>>> similar to how the TableCatalog interface allows a catalog to expose
>>> tables. The proposal doc is available here:
>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>
>>> Here's a high-level summary of some of the main design choices:
>>> * Adds the ability to list and load functions, not to create or modify
>>> them in an external catalog
>>> * Supports scalar, aggregate, and partial aggregate functions
>>> * Uses load and bind steps for better error messages and simpler
>>> implementations
>>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>>> data
>>> * Can be extended using mix-in interfaces to add vectorization, codegen,
>>> and other future features
>>>
>>> There is also a PR with the proposed API:
>>> https://github.com/apache/spark/pull/24559/files
>>>
>>> Let's discuss the proposal here rather than on that PR, to get better
>>> visibility. Also, please take the time to read the proposal first. That
>>> really helps clear up misconceptions.
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Mime
View raw message