spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuanjian Li <xyliyuanj...@gmail.com>
Subject Re: [DISCUSS] PostgreSQL dialect
Date Thu, 05 Dec 2019 00:56:45 GMT
Thanks all of you for joining the discussion.
The PR is given in https://github.com/apache/spark/pull/26763, all the
PostgreSQL dialect related PRs are linked in the description.
Hoping the authors could help in reviewing.

Best,
Yuanjian

Driesprong, Fokko <fokko@driesprong.frl> 于2019年12月1日周日 下午7:24写道:

> +1 (non-binding)
>
> Cheers, Fokko
>
> Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun <dongjoon.hyun@gmail.com
> >:
>
>> +1
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro <linguin.m.s@gmail.com>
>> wrote:
>>
>>> Yea, +1, that looks pretty reasonable to me.
>>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>> from the codebase before it's too late. Curently we only have 3 features
>>> under PostgreSQL dialect:
>>> I personally think we could at least stop work about the Dialect until
>>> 3.0 released.
>>>
>>>
>>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>>> gengliang.wang@databricks.com> wrote:
>>>
>>>> +1 with the practical proposal.
>>>> To me, the major concern is that the code base becomes complicated,
>>>> while the PostgreSQL dialect has very limited features. I tried introducing
>>>> one big flag `spark.sql.dialect` and isolating related code in #25697
>>>> <https://github.com/apache/spark/pull/25697>, but it seems hard to
be
>>>> clean.
>>>> Furthermore, the PostgreSQL dialect configuration overlaps with the
>>>> ANSI mode, which can be confusing sometimes.
>>>>
>>>> Gengliang
>>>>
>>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li <lixiao@databricks.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>> One particular negative effect has been that new postgresql tests
add
>>>>>> well over an hour to tests,
>>>>>
>>>>>
>>>>> Adding postgresql tests is for improving the test coverage of Spark
>>>>> SQL. We should continue to do this by importing more test cases. The
>>>>> quality of Spark highly depends on the test coverage. We can further
>>>>> paralyze the test execution to reduce the test time.
>>>>>
>>>>> Migrating PostgreSQL workloads to Spark SQL
>>>>>
>>>>>
>>>>> This should not be our current focus. In the near future, it is
>>>>> impossible to be fully compatible with PostgreSQL. We should focus on
>>>>> adding features that are useful to Spark community. PostgreSQL is a good
>>>>> reference, but we do not need to blindly follow it. We already closed
>>>>> multiple related JIRAs that try to add some PostgreSQL features that
are
>>>>> not commonly used.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>>>> mszymkiewicz@gmail.com> wrote:
>>>>>
>>>>>> I think it is important to distinguish between two different concepts:
>>>>>>
>>>>>>    - Adherence to standards and their well established
>>>>>>    implementations.
>>>>>>    - Enabling migrations from some product X to Spark.
>>>>>>
>>>>>> While these two problems are related, there are independent and one
>>>>>> can be achieved without the other.
>>>>>>
>>>>>>    - The former approach doesn't imply that all features of SQL
>>>>>>    standard (or its specific implementation) are provided. It is
sufficient
>>>>>>    that commonly used features that are implemented, are standard
compliant.
>>>>>>    Therefore if end user applies some well known pattern, thing will
work as
>>>>>>    expected. I
>>>>>>
>>>>>>    In my personal opinion that's something that is worth the
>>>>>>    required development resources, and in general should happen within
the
>>>>>>    project.
>>>>>>
>>>>>>
>>>>>>    - The latter one is more complicated. First of all the premise
>>>>>>    that one can "migrate PostgreSQL workloads to Spark" seems to
be flawed.
>>>>>>    While both Spark and PostgreSQL evolve, and probably have more
in common
>>>>>>    today, than a few years ago, they're not even close enough to
pretend that
>>>>>>    one can be replacement for the other. In contrast, existing compatibility
>>>>>>    layers between major vendors make sense, because feature disparity
>>>>>>    (at least when it comes to core functionality) is usually
>>>>>>    minimal. And that doesn't even touch the problem that PostgreSQL
provides
>>>>>>    extensively used extension points that enable broad and evolving
ecosystem
>>>>>>    (what should we do about continuous queries? Should Structured
Streaming
>>>>>>    provide some compatibility layer as well?).
>>>>>>
>>>>>>    More realistically Spark could provide a compatibility layer with
>>>>>>    some analytical tools that itself provide some PostgreSQL compatibility,
>>>>>>    but these are not always fully compatible with upstream PostgreSQL,
nor
>>>>>>    necessarily follow the latest PostgreSQL development.
>>>>>>
>>>>>>    Furthermore compatibility layer can be, within certain limits
>>>>>>    (i.e. availability of required primitives), maintained as a separate
>>>>>>    project, without putting more strain on existing resources. Effectively
>>>>>>    what we care about here is if we can translate certain SQL string
into
>>>>>>    logical or physical plan.
>>>>>>
>>>>>>
>>>>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Recently we start an effort to achieve feature parity between Spark
>>>>>> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>>>>>
>>>>>> This goes very well. We've added many missing features(parser rules,
>>>>>> built-in functions, etc.) to Spark, and also corrected several
>>>>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>>>>> Many thanks to all the people that contribute to it!
>>>>>>
>>>>>> There are several cases when adding a PostgreSQL feature:
>>>>>> 1. Spark doesn't have this feature: just add it.
>>>>>> 2. Spark has this feature, but the behavior is different:
>>>>>>     2.1 Spark's behavior doesn't make sense: change it to follow
SQL
>>>>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>>>>     2.2 Spark's behavior makes sense but violates SQL standard:
>>>>>> change the behavior to follow SQL standard and PostgreSQL, when the
ansi
>>>>>> mode is enabled (default false).
>>>>>>     2.3 Spark's behavior makes sense and doesn't violate SQL
>>>>>> standard: adds the PostgreSQL behavior under the PostgreSQL dialect
>>>>>> (default is Spark native dialect).
>>>>>>
>>>>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>>>>> migrate PostgreSQL workloads to Spark. Other databases have this
strategy
>>>>>> too. For example, DB2 provides an oracle dialect
>>>>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
>>>>>> .
>>>>>>
>>>>>> However, there are so many differences between Spark and PostgreSQL,
>>>>>> including SQL parsing, type coercion, function/operator behavior,
data
>>>>>> types, etc. I'm afraid that we may spend a lot of effort on it, and
make
>>>>>> the Spark codebase pretty complicated, but still not able to provide
a
>>>>>> usable PostgreSQL dialect.
>>>>>>
>>>>>> Furthermore, it's not clear to me how many users have the requirement
>>>>>> of migrating PostgreSQL workloads. I think it's much more important
to make
>>>>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>>>>>
>>>>>> Recently I've seen multiple PRs adding PostgreSQL cast functions,
>>>>>> while our own cast function is not ANSI-compliant yet. This makes
me think
>>>>>> that, we should do something to properly prioritize ANSI mode over
other
>>>>>> dialects.
>>>>>>
>>>>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove
>>>>>> it from the codebase before it's too late. Curently we only have
3 features
>>>>>> under PostgreSQL dialect:
>>>>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are
>>>>>> also allowed as true string.
>>>>>> 2. `date - date`  returns interval in Spark (SQL standard behavior),
>>>>>> but return int in PostgreSQL
>>>>>> 3. `int / int` returns double in Spark, but returns int in
>>>>>> PostgreSQL. (there is no standard)
>>>>>>
>>>>>> We should still add PostgreSQL features that Spark doesn't have,
or
>>>>>> Spark's behavior violates SQL standard. But for others, let's just
update
>>>>>> the answer files of PostgreSQL tests.
>>>>>>
>>>>>> Any comments are welcome!
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Maciej
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> [image: Databricks Summit - Watch the talks]
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>

Mime
View raw message