spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: DSv2 & DataSourceRegister
Date Wed, 08 Apr 2020 13:35:28 GMT
It would be good to support your use case, but I'm not sure how to
accomplish that. Can you open a PR so that we can discuss it in detail? How
can `public Class<? implements DataSourceV2> getImplementation();` be
possible in 3.0 as there is no `DataSourceV2`?

On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo <andrew.melo@gmail.com> wrote:

> Hello
>
> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not
>> sure this is possible as the DS V2 API is very different in 3.0, e.g. there
>> is no `DataSourceV2` anymore, and you should implement `TableProvider` (if
>> you don't have database/table).
>>
>
> Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel
> Root_v24 (implements DataSourceV2) and Root_v30 (implements TableProvider).
> I can load this jar in a both pyspark 2.4 and 3.0 and it works well -- as
> long as I remove the registration from META-INF and pass in the full class
> name to the DataFrameReader.
>
> Thanks
> Andrew
>
>
>> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <andrew.melo@gmail.com> wrote:
>>
>>> Hi Ryan,
>>>
>>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <rblue@netflix.com> wrote:
>>> >
>>> > Hi Andrew,
>>> >
>>> > With DataSourceV2, I recommend plugging in a catalog instead of using
>>> DataSource. As you've noticed, the way that you plug in data sources isn't
>>> very flexible. That's one of the reasons why we changed the plugin system
>>> and made it possible to use named catalogs that load implementations based
>>> on configuration properties.
>>> >
>>> > I think it's fine to consider how to patch the registration trait, but
>>> I really don't recommend continuing to identify table implementations
>>> directly by name.
>>>
>>> Can you be a bit more concrete with what you mean by plugging a
>>> catalog instead of a DataSource? We have been using
>>> sc.read.format("root").load([list of paths]) which works well. Since
>>> we don't have a database or tables, I don't fully understand what's
>>> different between the two interfaces that would make us prefer one or
>>> another.
>>>
>>> That being said, WRT the registration trait, if I'm not misreading
>>> createTable() and friends, the "source" parameter is resolved the same
>>> way as DataFrameReader.format(), so a solution that helps out
>>> registration should help both interfaces.
>>>
>>> Thanks again,
>>> Andrew
>>>
>>> >
>>> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.melo@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>>> >> send an email to the dev list for discussion.
>>> >>
>>> >> As the DSv2 API evolves, some breaking changes are occasionally made
>>> >> to the API. It's possible to split a plugin into a "common" part and
>>> >> multiple version-specific parts and this works OK to have a single
>>> >> artifact for users, as long as they write out the fully qualified
>>> >> classname as the DataFrame format(). The one part that can't be
>>> >> currently worked around is the DataSourceRegister trait. Since classes
>>> >> which implement DataSourceRegister must also implement DataSourceV2
>>> >> (and its mixins), changes to those interfaces cause the ServiceLoader
>>> >> to fail when it attempts to load the "wrong" DataSourceV2 class.
>>> >> (there's also an additional prohibition against multiple
>>> >> implementations having the same ShortName in
>>> >>
>>> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
>>> >> This means users will need to update their notebooks/code/tutorials
if
>>> >> they run @ a different site whose cluster is a different version.
>>> >>
>>> >> To solve this, I proposed in SPARK-31363 a new trait who would
>>> >> function the same as the existing DataSourceRegister trait, but adds
>>> >> an additional method:
>>> >>
>>> >> public Class<? implements DataSourceV2> getImplementation();
>>> >>
>>> >> ...which will allow DSv2 plugins to dynamically choose the appropriate
>>> >> DataSourceV2 class based on the runtime environment. This would let
us
>>> >> release a single artifact for different Spark versions and users could
>>> >> use the same artifactID & format regardless of where they were
>>> >> executing their code. If there was no services registered with this
>>> >> new trait, the functionality would remain the same as before.
>>> >>
>>> >> I think this functionality will be useful as DSv2 continues to evolve,
>>> >> please let me know your thoughts.
>>> >>
>>> >> Thanks
>>> >> Andrew
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>

Mime
View raw message