spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Erlandson <eerla...@redhat.com>
Subject Re: UDAFs have an inefficiency problem
Date Wed, 27 Mar 2019 23:40:43 GMT
BTW, if this is known, is there an existing JIRA I should link to?

On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson <eerlands@redhat.com> wrote:

>
> At a high level, some candidate strategies are:
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
> trait itself) so that the update method can do the right thing.
> 2. Expose TypedImperativeAggregate to users for defining their own, since
> it already does the right thing.
> 3. As a workaround, allow users to define their own sub-classes of
> DataType.  It would essentially allow one to define the sqlType of the UDT
> to be the aggregating object itself and make ser/de a no-op.  I tried doing
> this and it will compile, but spark's internals only consider a predefined
> universe of DataType classes.
>
> All of these options are likely to have implications for the catalyst
> systems. I'm not sure if they are minor more substantial.
>
> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin <rxin@databricks.com> wrote:
>
>> Yes this is known and an issue for performance. Do you have any thoughts
>> on how to fix this?
>>
>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson <eerlands@redhat.com>
>> wrote:
>>
>>> I describe some of the details here:
>>> https://issues.apache.org/jira/browse/SPARK-27296
>>>
>>> The short version of the story is that aggregating data structures
>>> (UDTs) used by UDAFs are serialized to a Row object, and de-serialized, for
>>> every row in a data frame.
>>> Cheers,
>>> Erik
>>>
>>>

Mime
View raw message