spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: Official support of CREATE EXTERNAL TABLE
Date Wed, 07 Oct 2020 18:54:49 GMT
I don’t think Spark ever claims to be 100% Hive compatible.

By accepting the EXTERNAL keyword in some circumstances, Spark is providing
compatibility with Hive DDL. Yes, there are places where it breaks. The
question is whether we should deliberately break what a Hive catalog could
implement, when we know what Hive’s behavior is.

CREATE EXTERNAL TABLE is not a Hive-specific feature

Great. So there are other catalogs that could use it. Why should Spark
choose to limit Hive’s interpretation of this keyword?

While it is great that we seem to agree that Spark shouldn’t do this — now
that Nessie was pointed out — I’m concerned that you still seem to think
this is a choice that Spark could reasonably make. *Spark cannot
arbitrarily choose how to interpret DDL for an external catalog*.

You may not consider this arbitrary because there are other examples where
location is required. But the Hive community made the choice that these
clauses are orthogonal, so it is clearly a choice of the external system,
and it is not Spark’s role to dictate how an external database should
behave.

I think the Nessie use case is good enough to justify the decoupling of
EXTERNAL and LOCATION.

It appears that we have consensus. This will be passed to catalogs, which
can implement the behavior that they choose.

On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan <cloud0fan@gmail.com> wrote:

> > I have some hive queries that I want to run on Spark.
>
> Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
> LOCATION can't help you too much here. If you do have this use case, we
> need a much wider discussion about how to achieve it.
>
> For this particular topic, we need concrete use cases like Nessie
> <https://projectnessie.org/tools/hive/>. It will be great to see more
> concrete use cases, but I think the Nessie use case is good enough to
> justify the decoupling of EXTERNAL and LOCATION.
>
> BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
> have it. That's why I think Hive-compatibility alone is not a reasonable
> use case. For your reference:
> 1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
> clause as Spark does: doc
> <https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html>
> 2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION
> clause as Spark does: doc
> <https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html>
> 3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME
> option: doc
> <https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r_create_ext_table.html>
> 4. SQL Server also supports CREATE EXTERNAL TABLE: doc
> <https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15>
>
> > with which Spark claims to be compatible
>
> I don't think Spark ever claims to be 100% Hive compatible. In fact, we
> diverged from Hive intentionally in several places, where we think the Hive
> behavior was not reasonable and we shouldn't follow it.
>
> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <rblue@netflix.com> wrote:
>
>> how about LOCATION without EXTERNAL? Currently Spark treats it as an
>> external table.
>>
>> I think there is some confusion about what Spark has to handle.
>> Regardless of what Spark allows as DDL, these tables can exist in a Hive
>> MetaStore that Spark connects to, and the general expectation is that Spark
>> doesn’t change the meaning of table configuration. There are notable bugs
>> where Spark has different behavior, but that is the expectation.
>>
>> In this particular case, we’re talking about what can be expressed in DDL
>> that is sent to an external catalog. Spark could (unwisely) choose to
>> disallow some DDL combinations, but the table is implemented through a
>> plugin so the interpretation is up to the plugin. Spark has no role in
>> choosing how to treat this table, unless it is loaded through Spark’s
>> built-in catalog; in which case, see above.
>>
>> I don’t think Hive compatibility itself is a “use case”.
>>
>> Why?
>>
>> Hive is an external database that defines its own behavior and with which
>> Spark claims to be compatible. If Hive isn’t a valid use case, then why is
>> EXTERNAL supported at all?
>>
>> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <holden@pigscanfly.ca>
>> wrote:
>>
>>>
>>>
>>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>
>>>> I don't think Hive compatibility itself is a "use case".
>>>>
>>> Ok let's add on top of this: I have some hive queries that I want to run
>>> on Spark. I believe that makes it a use case.
>>>
>>>> The Nessie <https://projectnessie.org/tools/hive/> example you
>>>> mentioned is a reasonable use case to me: some frameworks/applications want
>>>> to create external tables without user-specified location, so that they can
>>>> manage the table directory themselves and implement fancy features.
>>>>
>>>> That said, now I agree it's better to decouple EXTERNAL and LOCATION.
>>>> We should clearly document that, EXTERNAL and LOCATION are only applicable
>>>> for file-based data sources, and catalog implementation should fail if the
>>>> table has EXTERNAL or LOCATION property, but the table provider is not
>>>> file-based.
>>>>
>>>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as
>>>> an external table. Hive gives warning when you create managed tables with
>>>> custom location, which means this behavior is not recommended. Shall we
>>>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>>>
>>>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rblue@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>>>>
>>>>> The keyword came from Hive and we all agree that a Hive catalog with
>>>>> Hive behavior can’t be implemented if Spark chooses to couple this
with
>>>>> LOCATION. Why is this use case not a justification?
>>>>>
>>>>> Also, the option to keep behavior the same as before is not mutually
>>>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to
>>>>> have the same behavior in its catalog. But Spark cannot just choose to
>>>>> break compatibility with external systems by deciding when to fail certain
>>>>> combinations of DDL options. Choosing not to allow external without
>>>>> location when it is valid for Hive prevents building a compatible catalog.
>>>>>
>>>>> There are many reasons to build a Hive-compatible catalog. A great
>>>>> recent example is Nessie <https://projectnessie.org/tools/hive/>,
>>>>> which enables branching and tagging table states across several table
>>>>> formats and aims to be compatible with Hive.
>>>>>
>>>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > As someone who's had the job of porting different SQL dialects
to
>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>>>>
>>>>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2
>>>>>> options we are discussing are:
>>>>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>>>>>> with LOCATION (or path option).
>>>>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>>>>
>>>>>> I'm fine with option 2 if there are reasonable use cases. I think
>>>>>> it's always safer to keep the behavior the same as before. If we
want to
>>>>>> change the behavior and follow option 2, we need use cases to justify
it.
>>>>>>
>>>>>> For now, the only use case I see is for Hive compatibility and allow
>>>>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more
use
>>>>>> cases we are targeting?
>>>>>>
>>>>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <holden@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> As someone who's had the job of porting different SQL dialects
to
>>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I
think Ryan's
>>>>>>> suggestion of leaving it up to the catalogs on how to handle
this makes
>>>>>>> sense.
>>>>>>>
>>>>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rblue@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would summarize both the problem and the current state
>>>>>>>> differently.
>>>>>>>>
>>>>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility
>>>>>>>> with Hive SQL, but Spark’s built-in catalog doesn’t allow
creating a table
>>>>>>>> with EXTERNAL unless LOCATION is also present. *This “hidden
>>>>>>>> feature” breaks compatibility with Hive SQL* because all
>>>>>>>> combinations of EXTERNAL and LOCATION are valid in Hive,
but
>>>>>>>> creating an external table with a default location is not
allowed by Spark.
>>>>>>>> Note that Spark must still handle these tables because it
shares a
>>>>>>>> metastore with Hive, which can still create them.
>>>>>>>>
>>>>>>>> Now catalogs can be plugged in, the question is whether to
pass the
>>>>>>>> fact that EXTERNAL was in the CREATE TABLE statement to the
v2
>>>>>>>> catalog handling a create command, or to suppress it and
apply Spark’s rule
>>>>>>>> that LOCATION must be present.
>>>>>>>>
>>>>>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>>>>>> implement the behavior of Hive SQL, even though Spark added
the keyword for
>>>>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL
>>>>>>>> however Spark chooses to, but I think it is a poor choice
to force
>>>>>>>> different behavior on other catalogs.
>>>>>>>>
>>>>>>>> Wenchen has also argued that the purpose of this is to standardize
>>>>>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish
>>>>>>>> that goal. Whether to physically delete data is a choice
that is up to the
>>>>>>>> catalog. Some catalogs have no “external” concept and
will always drop data
>>>>>>>> when a table is dropped. The ability to keep underlying data
files is
>>>>>>>> specific to a few catalogs, and whether that is controlled
by
>>>>>>>> EXTERNAL, the LOCATION clause, or something else is still
up to
>>>>>>>> the catalog implementation.
>>>>>>>>
>>>>>>>> I don’t think that there is a good reason to force catalogs
to
>>>>>>>> break compatibility with Hive SQL, while making it appear
as though DDL is
>>>>>>>> compatible. Because removing EXTERNAL would be a breaking
change
>>>>>>>> to the SQL parser, I think the best option is to pass it
to v2 catalogs so
>>>>>>>> the catalog can decide how to handle it.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I'd like to start a discussion thread about this topic,
as it
>>>>>>>>> blocks an important feature that we target for Spark
3.1: unify the CREATE
>>>>>>>>> TABLE SQL syntax.
>>>>>>>>>
>>>>>>>>> A bit more background for CREATE EXTERNAL TABLE: it's
kind of a
>>>>>>>>> hidden feature in Spark for Hive compatibility.
>>>>>>>>>
>>>>>>>>> When you write native CREATE TABLE syntax such as `CREATE
>>>>>>>>> EXTERNAL TABLE ... USING parquet`, the parser fails and
tells you
>>>>>>>>> that EXTERNAL can't be specified.
>>>>>>>>>
>>>>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL
can be
>>>>>>>>> specified if LOCATION clause or path option is present.
For example, `CREATE
>>>>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed
as there is
>>>>>>>>> no LOCATION clause or path option. This is not 100% Hive
compatible.
>>>>>>>>>
>>>>>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem
is how
>>>>>>>>> to deal with CREATE EXTERNAL TABLE. We can keep it as
a hidden feature as
>>>>>>>>> it was, or we can officially support it.
>>>>>>>>>
>>>>>>>>> Please let us know your thoughts:
>>>>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL
TABLE to do?
>>>>>>>>> Have you used it in production before? For what use cases?
>>>>>>>>> 2. As a catalog developer, how are you going to implement
EXTERNAL
>>>>>>>>> TABLE? It seems to me that it only makes sense for file
source, as the
>>>>>>>>> table directory can be managed. I'm not sure how to interpret
EXTERNAL in
>>>>>>>>> catalogs like jdbc, cassandra, etc.
>>>>>>>>>
>>>>>>>>> For more details, please refer to the long discussion
in
>>>>>>>>> https://github.com/apache/spark/pull/28026
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Wenchen
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message