spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <iras...@cloudera.com>
Subject Re: SQL TIMESTAMP semantics vs. SPARK-18350
Date Sat, 27 May 2017 18:56:50 GMT
I had asked zoltan to bring this discussion to the dev list because I think
it's a question that extends beyond a single jira (we can't figure out the
semantics of timestamp in parquet if we don't k ow the overall goal of the
timestamp type) and since its a design question the entire community should
be involved.

I think that a lot of the confusion comes because we're talking about
different ways time zone affect behavior: (1) parsing and (2) behavior when
changing time zones for processing data.

It seems we agree that spark should eventually provide a timestamp type
which does conform to the standard.   The question is, how do we get
there?  Has spark already broken compliance so much that it's impossible to
go back without breaking user behavior?  Or perhaps spark already has
inconsistent behavior / broken compatibility within the 2.x line, so its
not unthinkable to have another breaking change?

(Another part of the confusion is on me -- I believed the behavior change
was in 2.2, but actually it looks like its in 2.0.1.  That changes how we
think about this in context of what goes into a 2.2 release.  SPARK-18350
isn't the origin of the difference in behavior.)

First: consider processing data that is already stored in tables, and then
accessing it from machines in different time zones.  The standard is clear
that "timestamp" should be just like "timestamp without time zone": it does
not represent one instant in time, rather it's always displayed the same,
regardless of time zone.  This was the behavior in spark 2.0.0 (and 1.6),
 for hive tables stored as text files, and for spark's json formats.

Spark 2.0.1  changed the behavior of the json format (I believe
with SPARK-16216), so that it behaves more like timestamp *with* time
zone.  It also makes csv behave the same (timestamp in csv was basically
broken in 2.0.0).  However it did *not* change the behavior of a hive
textfile; it still behaves like "timestamp with*out* time zone".  Here's
some experiments I tried -- there are a bunch of files there for
completeness, but mostly focus on the difference between
query_output_2_0_0.txt vs. query_output_2_0_1.txt

https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70

Given that spark has changed this behavior post 2.0.0, is it still out of
the question to change this behavior to bring it back in line with the sql
standard for timestamp (without time zone) in the 2.x line?  Or, as reynold
proposes, is the only option at this point to add an off-by-default feature
flag to get "timestamp without time zone" semantics?


Second, there is the question of parsing strings into timestamp type.   I'm
far less knowledgeable about this, so I mostly just have questions:

* does the standard dictate what the parsing behavior should be for
timestamp (without time zone) when a time zone is present?

* if it does and spark violates this standard is it worth trying to retain
the *other* semantics of timestamp without time zone, even if we violate
the parsing part?

I did look at what postgres does for comparison:

https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c

spark's timestamp certainly does not match postgres's timestamp for
parsing, it seems closer to postgres's "timestamp with timezone" -- though
I dunno if that is standard behavior at all.

thanks,
Imran

On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <rxin@databricks.com> wrote:

> That's just my point 4, isn't it?
>
>
> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.manor@equalum.io> wrote:
>
>> Reynold,
>> my point is that Spark should aim to follow the SQL standard instead of
>> rolling its own type system.
>> If I understand correctly, the existing implementation is similar to
>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE
>> data types which are missing from Spark.
>> So, it is better (for me) if instead of extending the existing types,
>> Spark would just implement the additional well-defined types properly.
>> Just trying to copy-paste CREATE TABLE between SQL engines should not be
>> an exercise of flags and incompatibilities.
>>
>> Regarding the current behaviour, if I remember correctly I had to force
>> our spark O/S user into UTC so Spark wont change my timestamps.
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>>
>> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <rxin@databricks.com> wrote:
>>
>>> Zoltan,
>>>
>>> Thanks for raising this again, although I'm a bit confused since I've
>>> communicated with you a few times on JIRA and on private emails to explain
>>> that you have some misunderstanding of the timestamp type in Spark and some
>>> of your statements are wrong (e.g. the except text file part). Not sure why
>>> you didn't get any of those.
>>>
>>>
>>> Here's another try:
>>>
>>>
>>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>>> before session local timezone change. IIUC, Spark has always assumed
>>> timestamps to be with timezone, since it parses timestamps with timezone
>>> and does all the datetime conversions with timezone in mind (it doesn't
>>> ignore timezone if a timestamp string has timezone specified). The session
>>> local timezone change further pushes Spark to that direction, but the
>>> semantics has been with timezone before that change. Just run Spark on
>>> machines with different timezone and you will know what I'm talking about.
>>>
>>> 2. CSV/Text is not different. The data type has always been "with
>>> timezone". If you put a timezone in the timestamp string, it parses the
>>> timezone.
>>>
>>> 3. We can't change semantics now, because it'd break all existing Spark
>>> apps.
>>>
>>> 4. We can however introduce a new timestamp without timezone type, and
>>> have a config flag to specify which one (with tz or without tz) is the
>>> default behavior.
>>>
>>>
>>>
>>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <zi@cloudera.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Sorry if you receive this mail twice, it seems that my first attempt
>>>> did not make it to the list for some reason.
>>>>
>>>> I would like to start a discussion about SPARK-18350
>>>> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets
>>>> released because it seems to be going in a different direction than what
>>>> other SQL engines of the Hadoop stack do.
>>>>
>>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT
>>>> TIME ZONE) to have timezone-agnostic semantics - basically a type that
>>>> expresses readings from calendars and clocks and is unaffected by time
>>>> zone. In the Hadoop stack, Impala has always worked like this and recently
>>>> Presto also took steps <https://github.com/prestodb/presto/issues/7122>
>>>> to become standards compliant. (Presto's design doc
>>>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>>> also contains a great summary of the different semantics.) Hive has a
>>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>>>> source of incompatibility that is already being addressed
>>>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in
>>>> SparkSQL, however, has UTC-normalized local time semantics (except for
>>>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
>>>> type.
>>>>
>>>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>>>> compliance and consistency with most SQL engines, I was wondering whether
>>>> SparkSQL should also consider it in order to become ANSI SQL compliant and
>>>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>>>> adapt this semantics in the future, SPARK-18350
>>>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to
be
>>>> a source of problems. Please correct me if I'm wrong, but this change seems
>>>> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
>>>> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
>>>> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
>>>> better becoming timezone-agnostic instead of gaining further timezone-aware
>>>> capabilities. (Of course becoming timezone-agnostic would be a behavior
>>>> change, so it must be optional and configurable by the user, as in Presto.)
>>>>
>>>> I would like to hear your opinions about this concern and about
>>>> TIMESTAMP semantics in general. Does the community agree that a
>>>> standards-compliant and interoperable TIMESTAMP type is desired? Do you
>>>> perceive SPARK-18350 as a potential problem in achieving this or do I
>>>> misunderstand the effects of this change?
>>>>
>>>> Thanks,
>>>>
>>>> Zoltan
>>>>
>>>> ---
>>>>
>>>> List of links in case in-line links do not work:
>>>>
>>>>    -
>>>>
>>>>    SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
>>>>    -
>>>>
>>>>    Presto's change: https://github.com/prestodb/presto/issues/7122
>>>>    -
>>>>
>>>>    Presto's design doc: https://docs.google.com/docume
>>>>    nt/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>>>>    <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message