spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <>
Subject Re: [DISCUSS] PostgreSQL dialect
Date Tue, 26 Nov 2019 23:51:52 GMT
Yea, +1, that looks pretty reasonable to me.
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
from the codebase before it's too late. Curently we only have 3 features
under PostgreSQL dialect:
I personally think we could at least stop work about the Dialect until 3.0

On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <> wrote:

> +1 with the practical proposal.
> To me, the major concern is that the code base becomes complicated, while
> the PostgreSQL dialect has very limited features. I tried introducing one
> big flag `spark.sql.dialect` and isolating related code in #25697
> <>, but it seems hard to be
> clean.
> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
> mode, which can be confusing sometimes.
> Gengliang
> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <> wrote:
>> +1
>>> One particular negative effect has been that new postgresql tests add
>>> well over an hour to tests,
>> Adding postgresql tests is for improving the test coverage of Spark SQL.
>> We should continue to do this by importing more test cases. The quality of
>> Spark highly depends on the test coverage. We can further paralyze the test
>> execution to reduce the test time.
>> Migrating PostgreSQL workloads to Spark SQL
>> This should not be our current focus. In the near future, it is
>> impossible to be fully compatible with PostgreSQL. We should focus on
>> adding features that are useful to Spark community. PostgreSQL is a good
>> reference, but we do not need to blindly follow it. We already closed
>> multiple related JIRAs that try to add some PostgreSQL features that are
>> not commonly used.
>> Cheers,
>> Xiao
>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>> wrote:
>>> I think it is important to distinguish between two different concepts:
>>>    - Adherence to standards and their well established implementations.
>>>    - Enabling migrations from some product X to Spark.
>>> While these two problems are related, there are independent and one can
>>> be achieved without the other.
>>>    - The former approach doesn't imply that all features of SQL
>>>    standard (or its specific implementation) are provided. It is sufficient
>>>    that commonly used features that are implemented, are standard compliant.
>>>    Therefore if end user applies some well known pattern, thing will work as
>>>    expected. I
>>>    In my personal opinion that's something that is worth the required
>>>    development resources, and in general should happen within the project.
>>>    - The latter one is more complicated. First of all the premise that
>>>    one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While
>>>    both Spark and PostgreSQL evolve, and probably have more in common today,
>>>    than a few years ago, they're not even close enough to pretend that one can
>>>    be replacement for the other. In contrast, existing compatibility layers
>>>    between major vendors make sense, because feature disparity (at
>>>    least when it comes to core functionality) is usually minimal. And that
>>>    doesn't even touch the problem that PostgreSQL provides extensively used
>>>    extension points that enable broad and evolving ecosystem (what should we
>>>    do about continuous queries? Should Structured Streaming provide some
>>>    compatibility layer as well?).
>>>    More realistically Spark could provide a compatibility layer with
>>>    some analytical tools that itself provide some PostgreSQL compatibility,
>>>    but these are not always fully compatible with upstream PostgreSQL, nor
>>>    necessarily follow the latest PostgreSQL development.
>>>    Furthermore compatibility layer can be, within certain limits (i.e.
>>>    availability of required primitives), maintained as a separate project,
>>>    without putting more strain on existing resources. Effectively what we care
>>>    about here is if we can translate certain SQL string into logical or
>>>    physical plan.
>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>> Hi all,
>>> Recently we start an effort to achieve feature parity between Spark and
>>> PostgreSQL:
>>> This goes very well. We've added many missing features(parser rules,
>>> built-in functions, etc.) to Spark, and also corrected several
>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>> Many thanks to all the people that contribute to it!
>>> There are several cases when adding a PostgreSQL feature:
>>> 1. Spark doesn't have this feature: just add it.
>>> 2. Spark has this feature, but the behavior is different:
>>>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>     2.2 Spark's behavior makes sense but violates SQL standard: change
>>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is
>>> enabled (default false).
>>>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
>>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
>>> native dialect).
>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>>> too. For example, DB2 provides an oracle dialect
>>> <>
>>> .
>>> However, there are so many differences between Spark and PostgreSQL,
>>> including SQL parsing, type coercion, function/operator behavior, data
>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make
>>> the Spark codebase pretty complicated, but still not able to provide a
>>> usable PostgreSQL dialect.
>>> Furthermore, it's not clear to me how many users have the requirement of
>>> migrating PostgreSQL workloads. I think it's much more important to make
>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>> Recently I've seen multiple PRs adding PostgreSQL cast functions, while
>>> our own cast function is not ANSI-compliant yet. This makes me think that,
>>> we should do something to properly prioritize ANSI mode over other dialects.
>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>> from the codebase before it's too late. Curently we only have 3 features
>>> under PostgreSQL dialect:
>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
>>> allowed as true string.
>>> 2. `date - date`  returns interval in Spark (SQL standard behavior), but
>>> return int in PostgreSQL
>>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
>>> (there is no standard)
>>> We should still add PostgreSQL features that Spark doesn't have, or
>>> Spark's behavior violates SQL standard. But for others, let's just update
>>> the answer files of PostgreSQL tests.
>>> Any comments are welcome!
>>> Thanks,
>>> Wenchen
>>> --
>>> Best regards,
>>> Maciej
>> --
>> [image: Databricks Summit - Watch the talks]
>> <>

Takeshi Yamamuro

View raw message