spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Erlandson <>
Subject Re: UDAFs have an inefficiency problem
Date Wed, 27 Mar 2019 23:40:43 GMT
BTW, if this is known, is there an existing JIRA I should link to?

On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson <> wrote:

> At a high level, some candidate strategies are:
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
> trait itself) so that the update method can do the right thing.
> 2. Expose TypedImperativeAggregate to users for defining their own, since
> it already does the right thing.
> 3. As a workaround, allow users to define their own sub-classes of
> DataType.  It would essentially allow one to define the sqlType of the UDT
> to be the aggregating object itself and make ser/de a no-op.  I tried doing
> this and it will compile, but spark's internals only consider a predefined
> universe of DataType classes.
> All of these options are likely to have implications for the catalyst
> systems. I'm not sure if they are minor more substantial.
> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin <> wrote:
>> Yes this is known and an issue for performance. Do you have any thoughts
>> on how to fix this?
>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson <>
>> wrote:
>>> I describe some of the details here:
>>> The short version of the story is that aggregating data structures
>>> (UDTs) used by UDAFs are serialized to a Row object, and de-serialized, for
>>> every row in a data frame.
>>> Cheers,
>>> Erik

View raw message