spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fitch, Simeon" <fi...@astraea.io>
Subject Re: Public API access to UDTs
Date Fri, 29 Jan 2021 15:42:02 GMT
On Fri, Jan 29, 2021 at 9:46 AM Sean Owen <srowen@gmail.com> wrote:

> Are there implications for storing UDTs in particular engines or formats?
>

I've found UDTs I/O to Parquet without problem.

They work fine with PySpark with implementation of mirror classes. Without
properly constructed mirror classe they show up as structs, which isn't a
bad fallback.

However, they do *not* work with Spark's use of Arrow, as they get rejected
here:
https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala#L75-L76




> Just making it public for developers, even with a 'use at your own risk'
> warning, seems pretty small as a change?
>
> On Thu, Jan 28, 2021 at 5:10 PM Fitch, Simeon <fitch@astraea.io> wrote:
>
>> Hi,
>>
>> First time posting here, so apologies if I need to be directing this
>> topic elsewhere.
>>
>> I'm the author of RasterFrames, and a contributor to GeoMesa's Spark SQL
>> module. Both make use of decently low level Catalyst constructs, include
>> custom UDTs; RasterFrames introduces a geospatial raster type, and GeoMesa
>> a geometry type.
>>
>> In order to make this work we've circumvented the [`package private`](
>> https://bit.ly/3pr0fVv)  restriction on `UDTRegistration` by inserting
>> sibling classes into the package namespace. It's a hack, and works fine
>> with JVM 8, but violates the [much more restrictive](
>> https://bit.ly/3aadO5g) module constructs in JVM 9+.
>>
>> We've been monitoring [SPARK-7768](
>> https://issues.apache.org/jira/browse/SPARK-7768) (filed in 2015)  and
>> it's [associated PR](https://github.com/apache/spark/pull/16478) for
>> years now, but it keeps getting kicked down the road(map).
>>
>> As authors of open source systems we completely understand how and why
>> this happens, but we are at a critical juncture in our projects' lifecycle,
>> anchored to JVM 8 while other systems have moved on to later versions. We'd
>> also like to enjoy the benefits of later JVMs.
>>
>> So... I'm here to find out how I and others critically needing public
>> access to `UDTRegistration` might better advocate for it?
>>
>> I think (but not 100% sure) the PR linked above is more extensive than
>> what we need, also addressing usability around Encoders, for which we have
>> our own type class solution. My assumption to date has been all we need is
>> line 32 of `UDTRegistration` deleted (if there's folly therein, please say
>> so!). While I understand a reluctance to promote `UDTRegistration` to
>> `public`, I note that it has not been changed since 2016, perhaps a good
>> indicator that the API is stable enough. Marking it as `@Experimental`
>> could be a compromise option.
>>
>> Thanks for reading this far and giving this consideration. Any and all
>> advice is appreciated.
>>
>> Simeon (@metasim)
>>
>>
>> --
>> Simeon Fitch
>> Co-founder & VP of R&D
>> Astraea, Inc.
>>
>>

-- 
Simeon Fitch
Co-founder & VP of R&D
Astraea, Inc.

Mime
View raw message