spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Official support of CREATE EXTERNAL TABLE
Date Wed, 07 Oct 2020 18:08:09 GMT
> I have some hive queries that I want to run on Spark.

Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
LOCATION can't help you too much here. If you do have this use case, we
need a much wider discussion about how to achieve it.

For this particular topic, we need concrete use cases like Nessie
<https://projectnessie.org/tools/hive/>. It will be great to see more
concrete use cases, but I think the Nessie use case is good enough to
justify the decoupling of EXTERNAL and LOCATION.

BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
have it. That's why I think Hive-compatibility alone is not a reasonable
use case. For your reference:
1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
clause as Spark does: doc
<https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html>
2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION clause
as Spark does: doc
<https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html>
3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME
option: doc
<https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r_create_ext_table.html>
4. SQL Server also supports CREATE EXTERNAL TABLE: doc
<https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15>

> with which Spark claims to be compatible

I don't think Spark ever claims to be 100% Hive compatible. In fact, we
diverged from Hive intentionally in several places, where we think the Hive
behavior was not reasonable and we shouldn't follow it.

On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <rblue@netflix.com> wrote:

> how about LOCATION without EXTERNAL? Currently Spark treats it as an
> external table.
>
> I think there is some confusion about what Spark has to handle. Regardless
> of what Spark allows as DDL, these tables can exist in a Hive MetaStore
> that Spark connects to, and the general expectation is that Spark doesn’t
> change the meaning of table configuration. There are notable bugs where
> Spark has different behavior, but that is the expectation.
>
> In this particular case, we’re talking about what can be expressed in DDL
> that is sent to an external catalog. Spark could (unwisely) choose to
> disallow some DDL combinations, but the table is implemented through a
> plugin so the interpretation is up to the plugin. Spark has no role in
> choosing how to treat this table, unless it is loaded through Spark’s
> built-in catalog; in which case, see above.
>
> I don’t think Hive compatibility itself is a “use case”.
>
> Why?
>
> Hive is an external database that defines its own behavior and with which
> Spark claims to be compatible. If Hive isn’t a valid use case, then why is
> EXTERNAL supported at all?
>
> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <holden@pigscanfly.ca> wrote:
>
>>
>>
>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> I don't think Hive compatibility itself is a "use case".
>>>
>> Ok let's add on top of this: I have some hive queries that I want to run
>> on Spark. I believe that makes it a use case.
>>
>>> The Nessie <https://projectnessie.org/tools/hive/> example you
>>> mentioned is a reasonable use case to me: some frameworks/applications want
>>> to create external tables without user-specified location, so that they can
>>> manage the table directory themselves and implement fancy features.
>>>
>>> That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
>>> should clearly document that, EXTERNAL and LOCATION are only applicable for
>>> file-based data sources, and catalog implementation should fail if the
>>> table has EXTERNAL or LOCATION property, but the table provider is not
>>> file-based.
>>>
>>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as
>>> an external table. Hive gives warning when you create managed tables with
>>> custom location, which means this behavior is not recommended. Shall we
>>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>>
>>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rblue@netflix.com.invalid>
>>> wrote:
>>>
>>>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>>>
>>>> The keyword came from Hive and we all agree that a Hive catalog with
>>>> Hive behavior can’t be implemented if Spark chooses to couple this with
>>>> LOCATION. Why is this use case not a justification?
>>>>
>>>> Also, the option to keep behavior the same as before is not mutually
>>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to
>>>> have the same behavior in its catalog. But Spark cannot just choose to
>>>> break compatibility with external systems by deciding when to fail certain
>>>> combinations of DDL options. Choosing not to allow external without
>>>> location when it is valid for Hive prevents building a compatible catalog.
>>>>
>>>> There are many reasons to build a Hive-compatible catalog. A great
>>>> recent example is Nessie <https://projectnessie.org/tools/hive/>,
>>>> which enables branching and tagging table states across several table
>>>> formats and aims to be compatible with Hive.
>>>>
>>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>>
>>>>> > As someone who's had the job of porting different SQL dialects to
>>>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>>>
>>>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2
>>>>> options we are discussing are:
>>>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>>>>> with LOCATION (or path option).
>>>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>>>
>>>>> I'm fine with option 2 if there are reasonable use cases. I think it's
>>>>> always safer to keep the behavior the same as before. If we want to change
>>>>> the behavior and follow option 2, we need use cases to justify it.
>>>>>
>>>>> For now, the only use case I see is for Hive compatibility and allow
>>>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>>>>> cases we are targeting?
>>>>>
>>>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <holden@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> As someone who's had the job of porting different SQL dialects to
>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I think
Ryan's
>>>>>> suggestion of leaving it up to the catalogs on how to handle this
makes
>>>>>> sense.
>>>>>>
>>>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rblue@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> I would summarize both the problem and the current state differently.
>>>>>>>
>>>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility
with
>>>>>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating
a table with
>>>>>>> EXTERNAL unless LOCATION is also present. *This “hidden feature”
>>>>>>> breaks compatibility with Hive SQL* because all combinations
of
>>>>>>> EXTERNAL and LOCATION are valid in Hive, but creating an external
>>>>>>> table with a default location is not allowed by Spark. Note that
Spark must
>>>>>>> still handle these tables because it shares a metastore with
Hive, which
>>>>>>> can still create them.
>>>>>>>
>>>>>>> Now catalogs can be plugged in, the question is whether to pass
the
>>>>>>> fact that EXTERNAL was in the CREATE TABLE statement to the v2
>>>>>>> catalog handling a create command, or to suppress it and apply
Spark’s rule
>>>>>>> that LOCATION must be present.
>>>>>>>
>>>>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>>>>> implement the behavior of Hive SQL, even though Spark added the
keyword for
>>>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL
>>>>>>> however Spark chooses to, but I think it is a poor choice to
force
>>>>>>> different behavior on other catalogs.
>>>>>>>
>>>>>>> Wenchen has also argued that the purpose of this is to standardize
>>>>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish
>>>>>>> that goal. Whether to physically delete data is a choice that
is up to the
>>>>>>> catalog. Some catalogs have no “external” concept and will
always drop data
>>>>>>> when a table is dropped. The ability to keep underlying data
files is
>>>>>>> specific to a few catalogs, and whether that is controlled by
>>>>>>> EXTERNAL, the LOCATION clause, or something else is still up
to the
>>>>>>> catalog implementation.
>>>>>>>
>>>>>>> I don’t think that there is a good reason to force catalogs
to break
>>>>>>> compatibility with Hive SQL, while making it appear as though
DDL is
>>>>>>> compatible. Because removing EXTERNAL would be a breaking change
to
>>>>>>> the SQL parser, I think the best option is to pass it to v2 catalogs
so the
>>>>>>> catalog can decide how to handle it.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0fan@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I'd like to start a discussion thread about this topic, as
it
>>>>>>>> blocks an important feature that we target for Spark 3.1:
unify the CREATE
>>>>>>>> TABLE SQL syntax.
>>>>>>>>
>>>>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind
of a
>>>>>>>> hidden feature in Spark for Hive compatibility.
>>>>>>>>
>>>>>>>> When you write native CREATE TABLE syntax such as `CREATE
EXTERNAL
>>>>>>>> TABLE ... USING parquet`, the parser fails and tells you
that
>>>>>>>> EXTERNAL can't be specified.
>>>>>>>>
>>>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can
be
>>>>>>>> specified if LOCATION clause or path option is present. For
example, `CREATE
>>>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there
is
>>>>>>>> no LOCATION clause or path option. This is not 100% Hive
compatible.
>>>>>>>>
>>>>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem
is how
>>>>>>>> to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden
feature as
>>>>>>>> it was, or we can officially support it.
>>>>>>>>
>>>>>>>> Please let us know your thoughts:
>>>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE
to do?
>>>>>>>> Have you used it in production before? For what use cases?
>>>>>>>> 2. As a catalog developer, how are you going to implement
EXTERNAL
>>>>>>>> TABLE? It seems to me that it only makes sense for file source,
as the
>>>>>>>> table directory can be managed. I'm not sure how to interpret
EXTERNAL in
>>>>>>>> catalogs like jdbc, cassandra, etc.
>>>>>>>>
>>>>>>>> For more details, please refer to the long discussion in
>>>>>>>> https://github.com/apache/spark/pull/28026
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Wenchen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Mime
View raw message