spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: [DISCUSS] PostgreSQL dialect
Date Tue, 26 Nov 2019 23:51:52 GMT
Yea, +1, that looks pretty reasonable to me.
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
from the codebase before it's too late. Curently we only have 3 features
under PostgreSQL dialect:
I personally think we could at least stop work about the Dialect until 3.0
released.


On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
gengliang.wang@databricks.com> wrote:

> +1 with the practical proposal.
> To me, the major concern is that the code base becomes complicated, while
> the PostgreSQL dialect has very limited features. I tried introducing one
> big flag `spark.sql.dialect` and isolating related code in #25697
> <https://github.com/apache/spark/pull/25697>, but it seems hard to be
> clean.
> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
> mode, which can be confusing sometimes.
>
> Gengliang
>
> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <lixiao@databricks.com> wrote:
>
>> +1
>>
>>
>>> One particular negative effect has been that new postgresql tests add
>>> well over an hour to tests,
>>
>>
>> Adding postgresql tests is for improving the test coverage of Spark SQL.
>> We should continue to do this by importing more test cases. The quality of
>> Spark highly depends on the test coverage. We can further paralyze the test
>> execution to reduce the test time.
>>
>> Migrating PostgreSQL workloads to Spark SQL
>>
>>
>> This should not be our current focus. In the near future, it is
>> impossible to be fully compatible with PostgreSQL. We should focus on
>> adding features that are useful to Spark community. PostgreSQL is a good
>> reference, but we do not need to blindly follow it. We already closed
>> multiple related JIRAs that try to add some PostgreSQL features that are
>> not commonly used.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>> mszymkiewicz@gmail.com> wrote:
>>
>>> I think it is important to distinguish between two different concepts:
>>>
>>>    - Adherence to standards and their well established implementations.
>>>    - Enabling migrations from some product X to Spark.
>>>
>>> While these two problems are related, there are independent and one can
>>> be achieved without the other.
>>>
>>>    - The former approach doesn't imply that all features of SQL
>>>    standard (or its specific implementation) are provided. It is sufficient
>>>    that commonly used features that are implemented, are standard compliant.
>>>    Therefore if end user applies some well known pattern, thing will work as
>>>    expected. I
>>>
>>>    In my personal opinion that's something that is worth the required
>>>    development resources, and in general should happen within the project.
>>>
>>>
>>>    - The latter one is more complicated. First of all the premise that
>>>    one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While
>>>    both Spark and PostgreSQL evolve, and probably have more in common today,
>>>    than a few years ago, they're not even close enough to pretend that one can
>>>    be replacement for the other. In contrast, existing compatibility layers
>>>    between major vendors make sense, because feature disparity (at
>>>    least when it comes to core functionality) is usually minimal. And that
>>>    doesn't even touch the problem that PostgreSQL provides extensively used
>>>    extension points that enable broad and evolving ecosystem (what should we
>>>    do about continuous queries? Should Structured Streaming provide some
>>>    compatibility layer as well?).
>>>
>>>    More realistically Spark could provide a compatibility layer with
>>>    some analytical tools that itself provide some PostgreSQL compatibility,
>>>    but these are not always fully compatible with upstream PostgreSQL, nor
>>>    necessarily follow the latest PostgreSQL development.
>>>
>>>    Furthermore compatibility layer can be, within certain limits (i.e.
>>>    availability of required primitives), maintained as a separate project,
>>>    without putting more strain on existing resources. Effectively what we care
>>>    about here is if we can translate certain SQL string into logical or
>>>    physical plan.
>>>
>>>
>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>>
>>> Hi all,
>>>
>>> Recently we start an effort to achieve feature parity between Spark and
>>> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>>
>>> This goes very well. We've added many missing features(parser rules,
>>> built-in functions, etc.) to Spark, and also corrected several
>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>> Many thanks to all the people that contribute to it!
>>>
>>> There are several cases when adding a PostgreSQL feature:
>>> 1. Spark doesn't have this feature: just add it.
>>> 2. Spark has this feature, but the behavior is different:
>>>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>     2.2 Spark's behavior makes sense but violates SQL standard: change
>>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is
>>> enabled (default false).
>>>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
>>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
>>> native dialect).
>>>
>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>>> too. For example, DB2 provides an oracle dialect
>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
>>> .
>>>
>>> However, there are so many differences between Spark and PostgreSQL,
>>> including SQL parsing, type coercion, function/operator behavior, data
>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make
>>> the Spark codebase pretty complicated, but still not able to provide a
>>> usable PostgreSQL dialect.
>>>
>>> Furthermore, it's not clear to me how many users have the requirement of
>>> migrating PostgreSQL workloads. I think it's much more important to make
>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>>
>>> Recently I've seen multiple PRs adding PostgreSQL cast functions, while
>>> our own cast function is not ANSI-compliant yet. This makes me think that,
>>> we should do something to properly prioritize ANSI mode over other dialects.
>>>
>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>> from the codebase before it's too late. Curently we only have 3 features
>>> under PostgreSQL dialect:
>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
>>> allowed as true string.
>>> 2. `date - date`  returns interval in Spark (SQL standard behavior), but
>>> return int in PostgreSQL
>>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
>>> (there is no standard)
>>>
>>> We should still add PostgreSQL features that Spark doesn't have, or
>>> Spark's behavior violates SQL standard. But for others, let's just update
>>> the answer files of PostgreSQL tests.
>>>
>>> Any comments are welcome!
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> --
>>> Best regards,
>>> Maciej
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
---
Takeshi Yamamuro

Mime
View raw message