spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Skipping Type Conversion and using InternalRows for UDF
Date Fri, 15 Apr 2016 18:47:40 GMT
This would also probably improve performance:
https://github.com/apache/spark/pull/9565

On Fri, Apr 15, 2016 at 8:44 AM, Hamel Kothari <hamelkothari@gmail.com>
wrote:

> Hi all,
>
> So we have these UDFs which take <1ms to operate and we're seeing pretty
> poor performance around them in practice, the overhead being >10ms for the
> projections (this data is deeply nested with ArrayTypes and MapTypes so
> that could be the cause). Looking at the logs and code for ScalaUDF, I
> noticed that there are a series of projections which take place before and
> after in order to make the Rows safe and then unsafe again. Is there any
> way to opt out of this and input/return InternalRows to skip the
> performance hit of the type conversion? It doesn't immediately appear to be
> possible but I'd like to make sure that I'm not missing anything.
>
> I suspect we could make this possible by checking if typetags in the
> register function are all internal types, if they are, passing a false
> value for "needs[Input|Output]Conversion" to ScalaUDF and then in ScalaUDF
> checking for that flag to figure out if the conversion process needs to
> take place. We're still left with the issue of missing a schema in the case
> of outputting InternalRows, but we could expose the DataType parameter
> rather than inferring it in the register function. Is there anything else
> in the code that would prevent this from working?
>
> Regards,
> Hamel
>

Mime
View raw message