spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <>
Subject Re: DataSourceV2 capability API
Date Fri, 09 Nov 2018 21:34:26 GMT
Another solution to the decimal case is using the capability API: use a
capability to signal that the table knows about `supports-decimal`. So
before the decimal support check, it would check

On Fri, Nov 9, 2018 at 12:45 PM Ryan Blue <> wrote:

> For that case, I think we would have a property that defines whether
> supports-decimal is assumed or checked with the capability.
> Wouldn't we have this problem no matter what the capability API is? If we
> used a trait to signal decimal support, then we would have to deal with
> sources that were written before the trait was introduced. That doesn't
> change the need for some way to signal support for specific capabilities
> like the ones I've suggested.
> On Fri, Nov 9, 2018 at 12:38 PM Reynold Xin <> wrote:
>> "If there is no way to report a feature (e.g., able to read missing as
>> null) then there is no way for Spark to take advantage of it in the first
>> place"
>> Consider this (just a hypothetical scenario): We added "supports-decimal"
>> in the future, because we see a lot of data sources don't support decimal
>> and we want a more graceful error handling. That'd break all existing data
>> sources.
>> You can say we would never add any "existing" features to the feature
>> list in the future, as a requirement for the feature list. But then I'm
>> wondering how much does it really give you, beyond telling data sources to
>> throw exceptions when they don't support a specific operation.
>> On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue <> wrote:
>>> Do you have an example in mind where we might add a capability and break
>>> old versions of data sources?
>>> These are really for being able to tell what features a data source has.
>>> If there is no way to report a feature (e.g., able to read missing as null)
>>> then there is no way for Spark to take advantage of it in the first place.
>>> For the uses I've proposed, forward compatibility isn't a concern. When we
>>> add a capability, we add handling for it that old versions wouldn't be able
>>> to use anyway. The advantage is that we don't have to treat all sources the
>>> same.
>>> On Fri, Nov 9, 2018 at 11:32 AM Reynold Xin <> wrote:
>>>> How do we deal with forward compatibility? Consider, Spark adds a new
>>>> "property". In the past the data source supports that property, but since
>>>> it was not explicitly defined, in the new version of Spark that data source
>>>> would be considered not supporting that property, and thus throwing an
>>>> exception.
>>>> On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue <> wrote:
>>>>> I'd have two places. First, a class that defines properties supported
>>>>> and identified by Spark, like the SQLConf definitions. Second, in
>>>>> documentation for the v2 table API.
>>>>> On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung <>
>>>>> wrote:
>>>>>> One question is where will the list of capability strings be defined?
>>>>>> ------------------------------
>>>>>> *From:* Ryan Blue <>
>>>>>> *Sent:* Thursday, November 8, 2018 2:09 PM
>>>>>> *To:* Reynold Xin
>>>>>> *Cc:* Spark Dev List
>>>>>> *Subject:* Re: DataSourceV2 capability API
>>>>>> Yes, we currently use traits that have methods. Something like
>>>>>> “supports reading missing columns” doesn’t need to deliver
methods. The
>>>>>> other example is where we don’t have an object to test for a trait
>>>>>> scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown
>>>>>> done. That could be expensive so we can use a capability to fail
>>>>>> On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin <>
>>>>>> wrote:
>>>>>>> This is currently accomplished by having traits that data sources
>>>>>>> can extend, as well as runtime exceptions right? It's hard to
argue one way
>>>>>>> vs another without knowing how things will evolve (e.g. how many
>>>>>>> capabilities there will be).
>>>>>>> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <>
>>>>>>> wrote:
>>>>>>>> Hi everyone,
>>>>>>>> I’d like to propose an addition to DataSourceV2 tables,
>>>>>>>> capability API. This API would allow Spark to query a table
to determine
>>>>>>>> whether it supports a capability or not:
>>>>>>>> val table = catalog.load(identifier)
>>>>>>>> val supportsContinuous = table.isSupported("continuous-streaming")
>>>>>>>> There are a couple of use cases for this. First, we want
to be able
>>>>>>>> to fail fast when a user tries to stream a table that doesn’t
support it.
>>>>>>>> The design of our read implementation doesn’t necessarily
support this. If
>>>>>>>> we want to share the same “scan” across streaming and
batch, then we need
>>>>>>>> to “branch” in the API after that point, but that is
at odds with failing
>>>>>>>> fast. We could use capabilities to fail fast and not worry
about that
>>>>>>>> concern in the read design.
>>>>>>>> I also want to use capabilities to change the behavior of
>>>>>>>> validation rules. The rule that validates appends, for example,
>>>>>>>> allow a write that is missing an optional column. That’s
because the
>>>>>>>> current v1 sources don’t support reading when columns are
missing. But
>>>>>>>> Iceberg does support reading a missing column as nulls, so
that users can
>>>>>>>> add a column to a table without breaking a scheduled job
that populates the
>>>>>>>> table. To fix this problem, I would use a table capability,
>>>>>>>> read-missing-columns-as-null.
>>>>>>>> Any comments on this approach?
>>>>>>>> rb
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
> --
> Ryan Blue
> Software Engineer
> Netflix

Ryan Blue
Software Engineer

View raw message