spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: [DISCUSS] PostgreSQL dialect
Date Thu, 28 Nov 2019 02:47:36 GMT
+1

Bests,
Dongjoon.

On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <linguin.m.s@gmail.com>
wrote:

> Yea, +1, that looks pretty reasonable to me.
> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
> from the codebase before it's too late. Curently we only have 3 features
> under PostgreSQL dialect:
> I personally think we could at least stop work about the Dialect until 3.0
> released.
>
>
> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
> gengliang.wang@databricks.com> wrote:
>
>> +1 with the practical proposal.
>> To me, the major concern is that the code base becomes complicated, while
>> the PostgreSQL dialect has very limited features. I tried introducing one
>> big flag `spark.sql.dialect` and isolating related code in #25697
>> <https://github.com/apache/spark/pull/25697>, but it seems hard to be
>> clean.
>> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
>> mode, which can be confusing sometimes.
>>
>> Gengliang
>>
>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <lixiao@databricks.com> wrote:
>>
>>> +1
>>>
>>>
>>>> One particular negative effect has been that new postgresql tests add
>>>> well over an hour to tests,
>>>
>>>
>>> Adding postgresql tests is for improving the test coverage of Spark SQL.
>>> We should continue to do this by importing more test cases. The quality of
>>> Spark highly depends on the test coverage. We can further paralyze the test
>>> execution to reduce the test time.
>>>
>>> Migrating PostgreSQL workloads to Spark SQL
>>>
>>>
>>> This should not be our current focus. In the near future, it is
>>> impossible to be fully compatible with PostgreSQL. We should focus on
>>> adding features that are useful to Spark community. PostgreSQL is a good
>>> reference, but we do not need to blindly follow it. We already closed
>>> multiple related JIRAs that try to add some PostgreSQL features that are
>>> not commonly used.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>> mszymkiewicz@gmail.com> wrote:
>>>
>>>> I think it is important to distinguish between two different concepts:
>>>>
>>>>    - Adherence to standards and their well established implementations.
>>>>    - Enabling migrations from some product X to Spark.
>>>>
>>>> While these two problems are related, there are independent and one can
>>>> be achieved without the other.
>>>>
>>>>    - The former approach doesn't imply that all features of SQL
>>>>    standard (or its specific implementation) are provided. It is sufficient
>>>>    that commonly used features that are implemented, are standard compliant.
>>>>    Therefore if end user applies some well known pattern, thing will work
as
>>>>    expected. I
>>>>
>>>>    In my personal opinion that's something that is worth the required
>>>>    development resources, and in general should happen within the project.
>>>>
>>>>
>>>>    - The latter one is more complicated. First of all the premise that
>>>>    one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While
>>>>    both Spark and PostgreSQL evolve, and probably have more in common today,
>>>>    than a few years ago, they're not even close enough to pretend that one
can
>>>>    be replacement for the other. In contrast, existing compatibility layers
>>>>    between major vendors make sense, because feature disparity (at
>>>>    least when it comes to core functionality) is usually minimal. And that
>>>>    doesn't even touch the problem that PostgreSQL provides extensively used
>>>>    extension points that enable broad and evolving ecosystem (what should
we
>>>>    do about continuous queries? Should Structured Streaming provide some
>>>>    compatibility layer as well?).
>>>>
>>>>    More realistically Spark could provide a compatibility layer with
>>>>    some analytical tools that itself provide some PostgreSQL compatibility,
>>>>    but these are not always fully compatible with upstream PostgreSQL, nor
>>>>    necessarily follow the latest PostgreSQL development.
>>>>
>>>>    Furthermore compatibility layer can be, within certain limits (i.e.
>>>>    availability of required primitives), maintained as a separate project,
>>>>    without putting more strain on existing resources. Effectively what we
care
>>>>    about here is if we can translate certain SQL string into logical or
>>>>    physical plan.
>>>>
>>>>
>>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Recently we start an effort to achieve feature parity between Spark and
>>>> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>>>
>>>> This goes very well. We've added many missing features(parser rules,
>>>> built-in functions, etc.) to Spark, and also corrected several
>>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>>> Many thanks to all the people that contribute to it!
>>>>
>>>> There are several cases when adding a PostgreSQL feature:
>>>> 1. Spark doesn't have this feature: just add it.
>>>> 2. Spark has this feature, but the behavior is different:
>>>>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
>>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>>     2.2 Spark's behavior makes sense but violates SQL standard: change
>>>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is
>>>> enabled (default false).
>>>>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
>>>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
>>>> native dialect).
>>>>
>>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>>>> too. For example, DB2 provides an oracle dialect
>>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
>>>> .
>>>>
>>>> However, there are so many differences between Spark and PostgreSQL,
>>>> including SQL parsing, type coercion, function/operator behavior, data
>>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make
>>>> the Spark codebase pretty complicated, but still not able to provide a
>>>> usable PostgreSQL dialect.
>>>>
>>>> Furthermore, it's not clear to me how many users have the requirement
>>>> of migrating PostgreSQL workloads. I think it's much more important to make
>>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>>>
>>>> Recently I've seen multiple PRs adding PostgreSQL cast functions, while
>>>> our own cast function is not ANSI-compliant yet. This makes me think that,
>>>> we should do something to properly prioritize ANSI mode over other dialects.
>>>>
>>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>>> from the codebase before it's too late. Curently we only have 3 features
>>>> under PostgreSQL dialect:
>>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
>>>> allowed as true string.
>>>> 2. `date - date`  returns interval in Spark (SQL standard behavior),
>>>> but return int in PostgreSQL
>>>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
>>>> (there is no standard)
>>>>
>>>> We should still add PostgreSQL features that Spark doesn't have, or
>>>> Spark's behavior violates SQL standard. But for others, let's just update
>>>> the answer files of PostgreSQL tests.
>>>>
>>>> Any comments are welcome!
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>> --
>>>> Best regards,
>>>> Maciej
>>>>
>>>>
>>>
>>> --
>>> [image: Databricks Summit - Watch the talks]
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

Mime
View raw message