spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Melo <andrew.m...@gmail.com>
Subject Re: DSv2 & DataSourceRegister
Date Tue, 07 Apr 2020 22:58:25 GMT
Hi Ryan,

On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <rblue@netflix.com> wrote:
>
> Hi Andrew,
>
> With DataSourceV2, I recommend plugging in a catalog instead of using DataSource. As
you've noticed, the way that you plug in data sources isn't very flexible. That's one of the
reasons why we changed the plugin system and made it possible to use named catalogs that load
implementations based on configuration properties.
>
> I think it's fine to consider how to patch the registration trait, but I really don't
recommend continuing to identify table implementations directly by name.

Can you be a bit more concrete with what you mean by plugging a
catalog instead of a DataSource? We have been using
sc.read.format("root").load([list of paths]) which works well. Since
we don't have a database or tables, I don't fully understand what's
different between the two interfaces that would make us prefer one or
another.

That being said, WRT the registration trait, if I'm not misreading
createTable() and friends, the "source" parameter is resolved the same
way as DataFrameReader.format(), so a solution that helps out
registration should help both interfaces.

Thanks again,
Andrew

>
> On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.melo@gmail.com> wrote:
>>
>> Hi all,
>>
>> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>> send an email to the dev list for discussion.
>>
>> As the DSv2 API evolves, some breaking changes are occasionally made
>> to the API. It's possible to split a plugin into a "common" part and
>> multiple version-specific parts and this works OK to have a single
>> artifact for users, as long as they write out the fully qualified
>> classname as the DataFrame format(). The one part that can't be
>> currently worked around is the DataSourceRegister trait. Since classes
>> which implement DataSourceRegister must also implement DataSourceV2
>> (and its mixins), changes to those interfaces cause the ServiceLoader
>> to fail when it attempts to load the "wrong" DataSourceV2 class.
>> (there's also an additional prohibition against multiple
>> implementations having the same ShortName in
>> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
>> This means users will need to update their notebooks/code/tutorials if
>> they run @ a different site whose cluster is a different version.
>>
>> To solve this, I proposed in SPARK-31363 a new trait who would
>> function the same as the existing DataSourceRegister trait, but adds
>> an additional method:
>>
>> public Class<? implements DataSourceV2> getImplementation();
>>
>> ...which will allow DSv2 plugins to dynamically choose the appropriate
>> DataSourceV2 class based on the runtime environment. This would let us
>> release a single artifact for different Spark versions and users could
>> use the same artifactID & format regardless of where they were
>> executing their code. If there was no services registered with this
>> new trait, the functionality would remain the same as before.
>>
>> I think this functionality will be useful as DSv2 continues to evolve,
>> please let me know your thoughts.
>>
>> Thanks
>> Andrew
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message