calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Handling of system attributes in a row
Date Wed, 15 Feb 2017 19:18:33 GMT
The “system columns” feature was working when the Calcite code base was used in a previous
project several years ago, though it may have atrophied since.

In that project, we actually had ROWTIME as a system column. It was partially successful,
but raised some problems. Consider a streaming join. The source relations have ROWTIME columns,
but the join would create an additional ROWTIME column.

(1) Should the original ROWTIME columns just disappear? It wasn’t clear.

(2) What if you wanted to assign your own ROWTIME from an expression. Is it sufficient to
just write ‘expression AS ROWTIME’ in the SELECT clause? This a more profound operation
than simply renaming a column. Being a system column implies certain data type, not null,
and maybe ordering. Would you also have to apply an ORDER BY?

(3) Having system columns in streams conflicts with the goal of using the same SQL for streaming
and historic (regular SQL) queries.

(4) We found ourselves relying on these columns under the covers in certain operators. The
system had become non-relational.

So, after that experience I concluded that ROWTIME should be just a regular column. Just a
column that we know a lot about: it is a timestamp (usually; although we allow other data
types); it is conventionally called ROWTIME but it doesn’t have to be; it is not null; it
is usually a sort column (but it may not be - the stream might be k-sorted, e.g. within 10
rows or 10 minutes of being sorted, or sorted within a particular warehouseId; or “sortable").

The “sortable” concept is really powerful. Consider the Orders stream and the Orders table.
The stream is infinite, so the only sortable columns are those that are already sorted or
are within N rows or T seconds of being sorted, or have some kind of guarantee in terms of
watermarks. The table is finite, so everything is sortable. If we want to do an operation
such as “GROUP BY FLOOR(ROWTIME TO HOUR)” it is sufficient that ROWTIME is sortable. You
do not need ROWTIME to be a system column.

My advice is, rather than requiring “blessed” system columns, you have a convention for
the names of event-time and processing-time columns, and make your operators consume and produce
on those columns explicitly.

Julian



> On Feb 15, 2017, at 12:56 AM, Timo Walther <twalthr@apache.org> wrote:
> 
> Hi everyone,
> 
> we (from Flink) are currently discussing how we can express time-semantics (event-time
or processing-time) in a SQL query. The optimal solution would be to have two system attributes
that are part of every table schema/every row data type. We could then access it like `SELECT
* FROM MyTable ORDER BY rowtime`. However, it should not be part of the result in an expansion
(`*`) and the user should not modify those attributes (no aliasing, read-only). I had a look
into SqlValidator and there are several lines that contain things like `includeSystemVars`
or `isSystemField` but nothing concrete. Am I right that this feature is not entirely implemented
yet? Which parts would you touch/override to implement this feature?
> 
> Thanks in advance.
> 
> Regards,
> Timo
> 
> 
> 


Mime
View raw message