spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Official support of CREATE EXTERNAL TABLE
Date Wed, 07 Oct 2020 12:51:17 GMT
> As someone who's had the job of porting different SQL dialects to Spark,
I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we
are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with
LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's
always safer to keep the behavior the same as before. If we want to change
the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow
EXTERNAL TABLE without user-specified LOCATION. Are there any more use
cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <holden@pigscanfly.ca> wrote:

> As someone who's had the job of porting different SQL dialects to Spark,
> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
> suggestion of leaving it up to the catalogs on how to handle this makes
> sense.
>
> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rblue@netflix.com.invalid>
> wrote:
>
>> I would summarize both the problem and the current state differently.
>>
>> Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
>> SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
>> compatibility with Hive SQL* because all combinations of EXTERNAL and
>> LOCATION are valid in Hive, but creating an external table with a
>> default location is not allowed by Spark. Note that Spark must still handle
>> these tables because it shares a metastore with Hive, which can still
>> create them.
>>
>> Now catalogs can be plugged in, the question is whether to pass the fact
>> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
>> handling a create command, or to suppress it and apply Spark’s rule that
>> LOCATION must be present.
>>
>> If it is not passed to the catalog, then a Hive catalog cannot implement
>> the behavior of Hive SQL, even though Spark added the keyword for Hive
>> compatibility. The Spark catalog can interpret EXTERNAL however Spark
>> chooses to, but I think it is a poor choice to force different behavior on
>> other catalogs.
>>
>> Wenchen has also argued that the purpose of this is to standardize
>> behavior across catalogs. But hiding EXTERNAL would not accomplish that
>> goal. Whether to physically delete data is a choice that is up to the
>> catalog. Some catalogs have no “external” concept and will always drop data
>> when a table is dropped. The ability to keep underlying data files is
>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>> the LOCATION clause, or something else is still up to the catalog
>> implementation.
>>
>> I don’t think that there is a good reason to force catalogs to break
>> compatibility with Hive SQL, while making it appear as though DDL is
>> compatible. Because removing EXTERNAL would be a breaking change to the
>> SQL parser, I think the best option is to pass it to v2 catalogs so the
>> catalog can decide how to handle it.
>>
>> rb
>>
>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start a discussion thread about this topic, as it blocks an
>>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>>> syntax.
>>>
>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>>> feature in Spark for Hive compatibility.
>>>
>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>> TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL
>>> can't be specified.
>>>
>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
>>> LOCATION clause or path option is present. For example, `CREATE
>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>
>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>>> was, or we can officially support it.
>>>
>>> Please let us know your thoughts:
>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>>> you used it in production before? For what use cases?
>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>> TABLE? It seems to me that it only makes sense for file source, as the
>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>> catalogs like jdbc, cassandra, etc.
>>>
>>> For more details, please refer to the long discussion in
>>> https://github.com/apache/spark/pull/28026
>>>
>>> Thanks,
>>> Wenchen
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Mime
View raw message