spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: Official support of CREATE EXTERNAL TABLE
Date Wed, 07 Oct 2020 21:18:51 GMT
I don’t think Spark ever claims to be 100% Hive compatible.

I just found some relevant documentation
<https://docs.databricks.com/spark/latest/spark-sql/compatibility/hive.html#apache-hive-compatibility>
on this, where Databricks claims that “Apache Spark SQL in Databricks is
designed to be compatible with the Apache Hive, including metastore
connectivity, SerDes, and UDFs.”

There is a strong expectation that Spark is compatible with Hive, and the
Spark community has made the claim. That isn’t a claim of 100%
compatibility, but no one suggested that there was 100% compatibility.

On Wed, Oct 7, 2020 at 11:54 AM Ryan Blue <rblue@netflix.com> wrote:

> I don’t think Spark ever claims to be 100% Hive compatible.
>
> By accepting the EXTERNAL keyword in some circumstances, Spark is
> providing compatibility with Hive DDL. Yes, there are places where it
> breaks. The question is whether we should deliberately break what a Hive
> catalog could implement, when we know what Hive’s behavior is.
>
> CREATE EXTERNAL TABLE is not a Hive-specific feature
>
> Great. So there are other catalogs that could use it. Why should Spark
> choose to limit Hive’s interpretation of this keyword?
>
> While it is great that we seem to agree that Spark shouldn’t do this — now
> that Nessie was pointed out — I’m concerned that you still seem to think
> this is a choice that Spark could reasonably make. *Spark cannot
> arbitrarily choose how to interpret DDL for an external catalog*.
>
> You may not consider this arbitrary because there are other examples where
> location is required. But the Hive community made the choice that these
> clauses are orthogonal, so it is clearly a choice of the external system,
> and it is not Spark’s role to dictate how an external database should
> behave.
>
> I think the Nessie use case is good enough to justify the decoupling of
> EXTERNAL and LOCATION.
>
> It appears that we have consensus. This will be passed to catalogs, which
> can implement the behavior that they choose.
>
> On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> > I have some hive queries that I want to run on Spark.
>>
>> Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
>> LOCATION can't help you too much here. If you do have this use case, we
>> need a much wider discussion about how to achieve it.
>>
>> For this particular topic, we need concrete use cases like Nessie
>> <https://projectnessie.org/tools/hive/>. It will be great to see more
>> concrete use cases, but I think the Nessie use case is good enough to
>> justify the decoupling of EXTERNAL and LOCATION.
>>
>> BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
>> have it. That's why I think Hive-compatibility alone is not a reasonable
>> use case. For your reference:
>> 1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
>> clause as Spark does: doc
>> <https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html>
>> 2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION
>> clause as Spark does: doc
>> <https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html>
>> 3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or
>> FILE_NAME option: doc
>> <https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r_create_ext_table.html>
>> 4. SQL Server also supports CREATE EXTERNAL TABLE: doc
>> <https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15>
>>
>> > with which Spark claims to be compatible
>>
>> I don't think Spark ever claims to be 100% Hive compatible. In fact, we
>> diverged from Hive intentionally in several places, where we think the Hive
>> behavior was not reasonable and we shouldn't follow it.
>>
>> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <rblue@netflix.com> wrote:
>>
>>> how about LOCATION without EXTERNAL? Currently Spark treats it as an
>>> external table.
>>>
>>> I think there is some confusion about what Spark has to handle.
>>> Regardless of what Spark allows as DDL, these tables can exist in a Hive
>>> MetaStore that Spark connects to, and the general expectation is that Spark
>>> doesn’t change the meaning of table configuration. There are notable bugs
>>> where Spark has different behavior, but that is the expectation.
>>>
>>> In this particular case, we’re talking about what can be expressed in
>>> DDL that is sent to an external catalog. Spark could (unwisely) choose to
>>> disallow some DDL combinations, but the table is implemented through a
>>> plugin so the interpretation is up to the plugin. Spark has no role in
>>> choosing how to treat this table, unless it is loaded through Spark’s
>>> built-in catalog; in which case, see above.
>>>
>>> I don’t think Hive compatibility itself is a “use case”.
>>>
>>> Why?
>>>
>>> Hive is an external database that defines its own behavior and with
>>> which Spark claims to be compatible. If Hive isn’t a valid use case, then
>>> why is EXTERNAL supported at all?
>>>
>>> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <holden@pigscanfly.ca>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>>
>>>>> I don't think Hive compatibility itself is a "use case".
>>>>>
>>>> Ok let's add on top of this: I have some hive queries that I want to
>>>> run on Spark. I believe that makes it a use case.
>>>>
>>>>> The Nessie <https://projectnessie.org/tools/hive/> example you
>>>>> mentioned is a reasonable use case to me: some frameworks/applications
want
>>>>> to create external tables without user-specified location, so that they
can
>>>>> manage the table directory themselves and implement fancy features.
>>>>>
>>>>> That said, now I agree it's better to decouple EXTERNAL and LOCATION.
>>>>> We should clearly document that, EXTERNAL and LOCATION are only applicable
>>>>> for file-based data sources, and catalog implementation should fail if
the
>>>>> table has EXTERNAL or LOCATION property, but the table provider is not
>>>>> file-based.
>>>>>
>>>>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as
>>>>> an external table. Hive gives warning when you create managed tables
with
>>>>> custom location, which means this behavior is not recommended. Shall
we
>>>>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>>>>
>>>>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rblue@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>>>>>
>>>>>> The keyword came from Hive and we all agree that a Hive catalog with
>>>>>> Hive behavior can’t be implemented if Spark chooses to couple this
with
>>>>>> LOCATION. Why is this use case not a justification?
>>>>>>
>>>>>> Also, the option to keep behavior the same as before is not mutually
>>>>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to
>>>>>> have the same behavior in its catalog. But Spark cannot just choose
to
>>>>>> break compatibility with external systems by deciding when to fail
certain
>>>>>> combinations of DDL options. Choosing not to allow external without
>>>>>> location when it is valid for Hive prevents building a compatible
catalog.
>>>>>>
>>>>>> There are many reasons to build a Hive-compatible catalog. A great
>>>>>> recent example is Nessie <https://projectnessie.org/tools/hive/>,
>>>>>> which enables branching and tagging table states across several table
>>>>>> formats and aims to be compatible with Hive.
>>>>>>
>>>>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > As someone who's had the job of porting different SQL dialects
to
>>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>>>>>
>>>>>>> Just to be clear: no one is proposing to remove EXTERNAL. The
2
>>>>>>> options we are discussing are:
>>>>>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must
>>>>>>> co-exists with LOCATION (or path option).
>>>>>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>>>>>
>>>>>>> I'm fine with option 2 if there are reasonable use cases. I think
>>>>>>> it's always safer to keep the behavior the same as before. If
we want to
>>>>>>> change the behavior and follow option 2, we need use cases to
justify it.
>>>>>>>
>>>>>>> For now, the only use case I see is for Hive compatibility and
allow
>>>>>>> EXTERNAL TABLE without user-specified LOCATION. Are there any
more use
>>>>>>> cases we are targeting?
>>>>>>>
>>>>>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <holden@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> As someone who's had the job of porting different SQL dialects
to
>>>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and
I think Ryan's
>>>>>>>> suggestion of leaving it up to the catalogs on how to handle
this makes
>>>>>>>> sense.
>>>>>>>>
>>>>>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rblue@netflix.com.invalid>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I would summarize both the problem and the current state
>>>>>>>>> differently.
>>>>>>>>>
>>>>>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility
>>>>>>>>> with Hive SQL, but Spark’s built-in catalog doesn’t
allow creating a table
>>>>>>>>> with EXTERNAL unless LOCATION is also present. *This
“hidden
>>>>>>>>> feature” breaks compatibility with Hive SQL* because
all
>>>>>>>>> combinations of EXTERNAL and LOCATION are valid in Hive,
but
>>>>>>>>> creating an external table with a default location is
not allowed by Spark.
>>>>>>>>> Note that Spark must still handle these tables because
it shares a
>>>>>>>>> metastore with Hive, which can still create them.
>>>>>>>>>
>>>>>>>>> Now catalogs can be plugged in, the question is whether
to pass
>>>>>>>>> the fact that EXTERNAL was in the CREATE TABLE statement
to the
>>>>>>>>> v2 catalog handling a create command, or to suppress
it and apply Spark’s
>>>>>>>>> rule that LOCATION must be present.
>>>>>>>>>
>>>>>>>>> If it is not passed to the catalog, then a Hive catalog
cannot
>>>>>>>>> implement the behavior of Hive SQL, even though Spark
added the keyword for
>>>>>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL
>>>>>>>>> however Spark chooses to, but I think it is a poor choice
to force
>>>>>>>>> different behavior on other catalogs.
>>>>>>>>>
>>>>>>>>> Wenchen has also argued that the purpose of this is to
standardize
>>>>>>>>> behavior across catalogs. But hiding EXTERNAL would not
>>>>>>>>> accomplish that goal. Whether to physically delete data
is a choice that is
>>>>>>>>> up to the catalog. Some catalogs have no “external”
concept and will always
>>>>>>>>> drop data when a table is dropped. The ability to keep
underlying data
>>>>>>>>> files is specific to a few catalogs, and whether that
is controlled by
>>>>>>>>> EXTERNAL, the LOCATION clause, or something else is still
up to
>>>>>>>>> the catalog implementation.
>>>>>>>>>
>>>>>>>>> I don’t think that there is a good reason to force
catalogs to
>>>>>>>>> break compatibility with Hive SQL, while making it appear
as though DDL is
>>>>>>>>> compatible. Because removing EXTERNAL would be a breaking
change
>>>>>>>>> to the SQL parser, I think the best option is to pass
it to v2 catalogs so
>>>>>>>>> the catalog can decide how to handle it.
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>>
>>>>>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I'd like to start a discussion thread about this
topic, as it
>>>>>>>>>> blocks an important feature that we target for Spark
3.1: unify the CREATE
>>>>>>>>>> TABLE SQL syntax.
>>>>>>>>>>
>>>>>>>>>> A bit more background for CREATE EXTERNAL TABLE:
it's kind of a
>>>>>>>>>> hidden feature in Spark for Hive compatibility.
>>>>>>>>>>
>>>>>>>>>> When you write native CREATE TABLE syntax such as
`CREATE
>>>>>>>>>> EXTERNAL TABLE ... USING parquet`, the parser fails
and tells
>>>>>>>>>> you that EXTERNAL can't be specified.
>>>>>>>>>>
>>>>>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL
can be
>>>>>>>>>> specified if LOCATION clause or path option is present.
For example, `CREATE
>>>>>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed
as there is
>>>>>>>>>> no LOCATION clause or path option. This is not 100%
Hive compatible.
>>>>>>>>>>
>>>>>>>>>> As we are unifying the CREATE TABLE SQL syntax, one
problem is
>>>>>>>>>> how to deal with CREATE EXTERNAL TABLE. We can keep
it as a hidden feature
>>>>>>>>>> as it was, or we can officially support it.
>>>>>>>>>>
>>>>>>>>>> Please let us know your thoughts:
>>>>>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL
TABLE to
>>>>>>>>>> do? Have you used it in production before? For what
use cases?
>>>>>>>>>> 2. As a catalog developer, how are you going to implement
>>>>>>>>>> EXTERNAL TABLE? It seems to me that it only makes
sense for file source, as
>>>>>>>>>> the table directory can be managed. I'm not sure
how to interpret EXTERNAL
>>>>>>>>>> in catalogs like jdbc, cassandra, etc.
>>>>>>>>>>
>>>>>>>>>> For more details, please refer to the long discussion
in
>>>>>>>>>> https://github.com/apache/spark/pull/28026
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Wenchen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message