spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Zhuge <jzh...@apache.org>
Subject Re: [DISCUSS] SPIP: FunctionCatalog
Date Thu, 04 Mar 2021 04:33:44 GMT
+1 Good plan to move forward.

Thank you all for the constructive and comprehensive discussions in this
thread! Decisions on this important feature will have ramifications for
years to come.

On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan <cloud0fan@gmail.com> wrote:

> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
> variants and prefer the "magical methods", then we can have a single
> ScalarFunction interface which has the row-parameter API (with a default
> implementation to fail) and documents to describe the "magical methods"
> (which can be done later).
>
> I'll start the PR review this week to check the naming, doc, etc.
>
> Thanks all for the discussion here and let's move forward!
>
> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue <rblue@netflix.com> wrote:
>
>> Good point, Dongjoon. I think we can probably come to some compromise
>> here:
>>
>>    - Remove SupportsInvoke since it isn’t really needed. We should
>>    always try to find the right method to invoke in the codegen path.
>>    - Add a default implementation of produceResult so that
>>    implementations don’t have to use it. If they don’t implement it and a
>>    magic function can’t be found, then it will throw
>>    UnsupportedOperationException
>>
>> This is assuming that we can agree not to introduce all of the
>> ScalarFunction interface variations, which would have limited utility
>> because of type erasure.
>>
>> Does that sound like a good plan to everyone? If so, I’ll update the SPIP
>> doc so we can move forward.
>>
>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> We shared many opinions in different perspectives.
>>> However, we didn't reach a consensus even on a partial merge by
>>> excluding something
>>> (on the PR by me, on this mailing thread by Wenchen).
>>>
>>> For the following claims, we have another alternative to mitigate it.
>>>
>>>     > I don't like it because it promotes the row-parameter API and
>>> forces users to implement it, even if the users want to only use the
>>> individual-parameters API.
>>>
>>> Why don't we merge the AS-IS PR by adding something instead of excluding
>>> something?
>>>
>>>     - R produceResult(InternalRow input);
>>>     + default R produceResult(InternalRow input) throws Exception {
>>>     +   throw new UnsupportedOperationException();
>>>     + }
>>>
>>> By providing the default implementation, it will not *forcing users to
>>> implement it* technically.
>>> And, we can provide a document about our expected usage properly.
>>> What do you think?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue <rblue@netflix.com> wrote:
>>>
>>>> Yes, GenericInternalRow is safe if when type mismatches, with the cost
>>>> of using Object[], and primitive types need to do boxing
>>>>
>>>> The question is not whether to use the magic functions, which would not
>>>> need boxing. The question here is whether to use multiple
>>>> ScalarFunction interfaces. Those interfaces would require boxing or
>>>> using Object[] so there isn’t a benefit.
>>>>
>>>> If we do want to reuse one UDF for different types, using “magical
>>>> methods” solves the problem
>>>>
>>>> Yes, that’s correct. We agree that magic methods are a good option for
>>>> this.
>>>>
>>>> Again, the question we need to decide is whether to use InternalRow or
>>>> interfaces like ScalarFunction2 for non-codegen. The option to use
>>>> multiple interfaces is limited by type erasure because you can only have
>>>> one set of type parameters. If you wanted to support both ScalarFunction2<Integer,
>>>> Integer> and ScalarFunction2<Long, Long> you’d have to fall back
to ScalarFunction2<Object,
>>>> Object> and cast.
>>>>
>>>> The point is that type erasure will commonly lead either to many
>>>> different implementation classes (one for each type combination) or will
>>>> lead to parameterizing by Object, which defeats the purpose.
>>>>
>>>> The alternative adds safety because correct types are produced by calls
>>>> like getLong(0). Yes, this depends on the implementation making the
>>>> correct calls, but it is better than using Object and casting.
>>>>
>>>> I can’t think of real use cases that will force the
>>>> individual-parameters approach to use Object instead of concrete types.
>>>>
>>>> I think this is addressed by the type erasure discussion above. A
>>>> simple Plus method would require Object or 12 implementations for 2
>>>> arguments and 4 numeric types.
>>>>
>>>> And basically all varargs cases would need to use Object[]. Consider a
>>>> UDF to create a map that requires string keys and some consistent type for
>>>> values. This would be easy with the InternalRow API because you can
>>>> use getString(pos) and get(pos + 1, valueType) to get the key/value
>>>> pairs. Use of UTF8String vs String will be checked at compile time.
>>>>
>>>> I agree that Object[] is worse than InternalRow
>>>>
>>>> Yes, and if we are always using Object because of type erasure or
>>>> using magic methods to get better performance, the utility of the
>>>> parameterized interfaces is very limited.
>>>>
>>>> Because we want to expose the magic functions, the use of
>>>> ScalarFunction2 and similar is extremely limited because it is only
>>>> for non-codegen. Varargs is by far the more common case. The
>>>> InternalRow interface is also a very simple way to get started and
>>>> ensures that Spark can always find the right method after the function is
>>>> bound to input types.
>>>>
>>>> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>>
>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the
>>>>> cost of using Object[], and primitive types need to do boxing. And
>>>>> this is a runtime failure, which is absolutely worse than
>>>>> query-compile-time checks. Also, don't forget my previous point: users
need
>>>>> to specify the type and index such as row.getLong(0), which is
>>>>> error-prone.
>>>>>
>>>>> > But we don’t do that for any of the similar UDFs today so I’m
>>>>> skeptical that this would actually be a high enough priority to implement.
>>>>>
>>>>> I'd say this is a must-have if we go with the individual-parameters
>>>>> approach. The Scala UDF today checks the method signature at compile-time,
>>>>> thanks to the Scala type tag. The Java UDF today doesn't check and is
hard
>>>>> to use.
>>>>>
>>>>> > You can’t implement ScalarFunction2<Integer, Integer> and
>>>>> ScalarFunction2<Long, Long>.
>>>>>
>>>>> Can you elaborate? We have function binding and we can use
>>>>> different UDFs for different input types. If we do want to reuse one
UDF
>>>>> for different types, using "magical methods" solves the problem:
>>>>> class MyUDF {
>>>>>   def call(i: Int): Int = ...
>>>>>   def call(l: Long): Long = ...
>>>>> }
>>>>>
>>>>> On the other side, I don't think the row-parameter approach can solve
>>>>> this problem. The best I can think of is:
>>>>> class MyUDF implement ScalaFunction[Object] {
>>>>>   def call(row: InternalRow): Object = {
>>>>>     if (int input) row.getInt(0) ... else row.getLong(0) ...
>>>>>   }
>>>>> }
>>>>>
>>>>> This is worse because: 1) it needs to do if-else to check different
>>>>> input types. 2) the return type can only be Object and cause boxing issues.
>>>>>
>>>>> I agree that Object[] is worse than InternalRow. But I can't think of
>>>>> real use cases that will force the individual-parameters approach to
use
>>>>> Object instead of concrete types.
>>>>>
>>>>>
>>>>> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <rblue@netflix.com> wrote:
>>>>>
>>>>>> Thanks for adding your perspective, Erik!
>>>>>>
>>>>>> If the input is string type but the UDF implementation calls
>>>>>> row.getLong(0), it returns wrong data
>>>>>>
>>>>>> I think this is misleading. It is true for UnsafeRow, but there is
>>>>>> no reason why InternalRow should return incorrect values.
>>>>>>
>>>>>> The implementation in GenericInternalRow
>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35>
>>>>>> would throw a ClassCastException. I don’t think that using a row
is
>>>>>> a bad option simply because UnsafeRow is unsafe.
>>>>>>
>>>>>> It’s unlikely that UnsafeRow would be used to pass the data. The
>>>>>> implementation would evaluate each argument expression and set the
result
>>>>>> in a generic row, then pass that row to the UDF. We can use whatever
>>>>>> implementation we choose to provide better guarantees than unsafe.
>>>>>>
>>>>>> I think we should consider query-compile-time checks as
>>>>>> nearly-as-good as Java-compile-time checks for the purposes of safety.
>>>>>>
>>>>>> I don’t think I agree with this. A failure at query analysis time
vs
>>>>>> runtime still requires going back to a separate project, fixing something,
>>>>>> and rebuilding. The time needed to fix a problem goes up significantly
vs.
>>>>>> compile-time checks. And that is even worse if the UDF is maintained
by
>>>>>> someone else.
>>>>>>
>>>>>> I think we also need to consider how common it would be that a use
>>>>>> case can have the query-compile-time checks. Going through this in
more
>>>>>> detail below makes me think that it is unlikely that these checks
would be
>>>>>> used often because of the limitations of using an interface with
type
>>>>>> erasure.
>>>>>>
>>>>>> I believe that Wenchen’s proposal will provide stronger
>>>>>> query-compile-time safety
>>>>>>
>>>>>> The proposal could have better safety for each argument, assuming
>>>>>> that we detect failures by looking at the parameter types using reflection
>>>>>> in the analyzer. But we don’t do that for any of the similar UDFs
today so
>>>>>> I’m skeptical that this would actually be a high enough priority
to
>>>>>> implement.
>>>>>>
>>>>>> As Erik pointed out, type erasure also limits the effectiveness.
You
>>>>>> can’t implement ScalarFunction2<Integer, Integer> and ScalarFunction2<Long,
>>>>>> Long>. You can handle those cases using InternalRow or you can
>>>>>> handle them using VarargScalarFunction<Object>. That forces
many use
>>>>>> cases into varargs with Object, where you don’t get any of the
>>>>>> proposed analyzer benefits and lose compile-time checks. The only
time the
>>>>>> additional checks (if implemented) would help is when only one set
of
>>>>>> argument types is needed because implementing ScalarFunction<Object,
>>>>>> Object> defeats the purpose.
>>>>>>
>>>>>> It’s worth noting that safety for the magic methods would be
>>>>>> identical between the two options, so the trade-off to consider is
for
>>>>>> varargs and non-codegen cases. Combining the limitations discussed,
this
>>>>>> has better safety guarantees only if you need just one set of types
for
>>>>>> each number of arguments and are using the non-codegen path. Since
varargs
>>>>>> is one of the primary reasons to use this API, then I don’t think
that it
>>>>>> is a good idea to use Object[] instead of InternalRow.
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
John Zhuge

Mime
View raw message