spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuanjian Li <>
Subject Re: [DISCUSS] PostgreSQL dialect
Date Thu, 05 Dec 2019 00:56:45 GMT
Thanks all of you for joining the discussion.
The PR is given in, all the
PostgreSQL dialect related PRs are linked in the description.
Hoping the authors could help in reviewing.


Driesprong, Fokko <> 于2019年12月1日周日 下午7:24写道:

> +1 (non-binding)
> Cheers, Fokko
> Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <
> >:
>> +1
>> Bests,
>> Dongjoon.
>> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <>
>> wrote:
>>> Yea, +1, that looks pretty reasonable to me.
>>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>> from the codebase before it's too late. Curently we only have 3 features
>>> under PostgreSQL dialect:
>>> I personally think we could at least stop work about the Dialect until
>>> 3.0 released.
>>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>>>> wrote:
>>>> +1 with the practical proposal.
>>>> To me, the major concern is that the code base becomes complicated,
>>>> while the PostgreSQL dialect has very limited features. I tried introducing
>>>> one big flag `spark.sql.dialect` and isolating related code in #25697
>>>> <>, but it seems hard to
>>>> clean.
>>>> Furthermore, the PostgreSQL dialect configuration overlaps with the
>>>> ANSI mode, which can be confusing sometimes.
>>>> Gengliang
>>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <> wrote:
>>>>> +1
>>>>>> One particular negative effect has been that new postgresql tests
>>>>>> well over an hour to tests,
>>>>> Adding postgresql tests is for improving the test coverage of Spark
>>>>> SQL. We should continue to do this by importing more test cases. The
>>>>> quality of Spark highly depends on the test coverage. We can further
>>>>> paralyze the test execution to reduce the test time.
>>>>> Migrating PostgreSQL workloads to Spark SQL
>>>>> This should not be our current focus. In the near future, it is
>>>>> impossible to be fully compatible with PostgreSQL. We should focus on
>>>>> adding features that are useful to Spark community. PostgreSQL is a good
>>>>> reference, but we do not need to blindly follow it. We already closed
>>>>> multiple related JIRAs that try to add some PostgreSQL features that
>>>>> not commonly used.
>>>>> Cheers,
>>>>> Xiao
>>>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>>>>> wrote:
>>>>>> I think it is important to distinguish between two different concepts:
>>>>>>    - Adherence to standards and their well established
>>>>>>    implementations.
>>>>>>    - Enabling migrations from some product X to Spark.
>>>>>> While these two problems are related, there are independent and one
>>>>>> can be achieved without the other.
>>>>>>    - The former approach doesn't imply that all features of SQL
>>>>>>    standard (or its specific implementation) are provided. It is
>>>>>>    that commonly used features that are implemented, are standard
>>>>>>    Therefore if end user applies some well known pattern, thing will
work as
>>>>>>    expected. I
>>>>>>    In my personal opinion that's something that is worth the
>>>>>>    required development resources, and in general should happen within
>>>>>>    project.
>>>>>>    - The latter one is more complicated. First of all the premise
>>>>>>    that one can "migrate PostgreSQL workloads to Spark" seems to
be flawed.
>>>>>>    While both Spark and PostgreSQL evolve, and probably have more
in common
>>>>>>    today, than a few years ago, they're not even close enough to
pretend that
>>>>>>    one can be replacement for the other. In contrast, existing compatibility
>>>>>>    layers between major vendors make sense, because feature disparity
>>>>>>    (at least when it comes to core functionality) is usually
>>>>>>    minimal. And that doesn't even touch the problem that PostgreSQL
>>>>>>    extensively used extension points that enable broad and evolving
>>>>>>    (what should we do about continuous queries? Should Structured
>>>>>>    provide some compatibility layer as well?).
>>>>>>    More realistically Spark could provide a compatibility layer with
>>>>>>    some analytical tools that itself provide some PostgreSQL compatibility,
>>>>>>    but these are not always fully compatible with upstream PostgreSQL,
>>>>>>    necessarily follow the latest PostgreSQL development.
>>>>>>    Furthermore compatibility layer can be, within certain limits
>>>>>>    (i.e. availability of required primitives), maintained as a separate
>>>>>>    project, without putting more strain on existing resources. Effectively
>>>>>>    what we care about here is if we can translate certain SQL string
>>>>>>    logical or physical plan.
>>>>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>>>>> Hi all,
>>>>>> Recently we start an effort to achieve feature parity between Spark
>>>>>> and PostgreSQL:
>>>>>> This goes very well. We've added many missing features(parser rules,
>>>>>> built-in functions, etc.) to Spark, and also corrected several
>>>>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>>>>> Many thanks to all the people that contribute to it!
>>>>>> There are several cases when adding a PostgreSQL feature:
>>>>>> 1. Spark doesn't have this feature: just add it.
>>>>>> 2. Spark has this feature, but the behavior is different:
>>>>>>     2.1 Spark's behavior doesn't make sense: change it to follow
>>>>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>>>>     2.2 Spark's behavior makes sense but violates SQL standard:
>>>>>> change the behavior to follow SQL standard and PostgreSQL, when the
>>>>>> mode is enabled (default false).
>>>>>>     2.3 Spark's behavior makes sense and doesn't violate SQL
>>>>>> standard: adds the PostgreSQL behavior under the PostgreSQL dialect
>>>>>> (default is Spark native dialect).
>>>>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>>>>> migrate PostgreSQL workloads to Spark. Other databases have this
>>>>>> too. For example, DB2 provides an oracle dialect
>>>>>> <>
>>>>>> .
>>>>>> However, there are so many differences between Spark and PostgreSQL,
>>>>>> including SQL parsing, type coercion, function/operator behavior,
>>>>>> types, etc. I'm afraid that we may spend a lot of effort on it, and
>>>>>> the Spark codebase pretty complicated, but still not able to provide
>>>>>> usable PostgreSQL dialect.
>>>>>> Furthermore, it's not clear to me how many users have the requirement
>>>>>> of migrating PostgreSQL workloads. I think it's much more important
to make
>>>>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>>>>> Recently I've seen multiple PRs adding PostgreSQL cast functions,
>>>>>> while our own cast function is not ANSI-compliant yet. This makes
me think
>>>>>> that, we should do something to properly prioritize ANSI mode over
>>>>>> dialects.
>>>>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove
>>>>>> it from the codebase before it's too late. Curently we only have
3 features
>>>>>> under PostgreSQL dialect:
>>>>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are
>>>>>> also allowed as true string.
>>>>>> 2. `date - date`  returns interval in Spark (SQL standard behavior),
>>>>>> but return int in PostgreSQL
>>>>>> 3. `int / int` returns double in Spark, but returns int in
>>>>>> PostgreSQL. (there is no standard)
>>>>>> We should still add PostgreSQL features that Spark doesn't have,
>>>>>> Spark's behavior violates SQL standard. But for others, let's just
>>>>>> the answer files of PostgreSQL tests.
>>>>>> Any comments are welcome!
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>> --
>>>>>> Best regards,
>>>>>> Maciej
>>>>> --
>>>>> [image: Databricks Summit - Watch the talks]
>>>>> <>
>>> --
>>> ---
>>> Takeshi Yamamuro

View raw message