drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From weijie tong <tongweijie...@gmail.com>
Subject Re: Possible way to specify column types in query
Date Fri, 07 Sep 2018 03:43:39 GMT
Google's latest paper about F1[1] claims to support any data sources by
using an extension api called TVF see section 6.3. Also need to declare
column datatype before the query.


[1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf

On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers <par0328@yahoo.com.invalid>
wrote:

> Hi All,
>
> We've discussed quite a few times whether Drill should or should not
> support or require schemas, and if so, how the user might express the
> schema.
>
> I came across a paper [1] that suggests a simple, elegant SQL extension:
>
> EXTRACT <column>[:<type>] {,<column>[:<type>]}
> FROM <stream_name>
>
> Paraphrasing into Drill's SQL:
>
> SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS
<alias>]}
> FROM <table_name>
>
> Have a collection of JSON files in which string column `foo` appears in
> only half the files? Don't want to get schema conflicts with VARCHAR and
> nullable INT? Just do:
>
> SELECT name:VARCHAR, age:INT, foo:VARCHAR
> FROM `my-dir` ...
>
> Not only can the syntax be used to specify the "natural" type for a
> column, it might also specify a preferred type. For example. "age:INT" says
> that "age" is an INT, even though JSON would normally parse it as a BIGINT.
> Similarly, using this syntax is a easy way to tell Drill how to convert CSV
> columns from strings to DATE, INT, FLOAT, etc. without the need for CAST
> functions. (CAST functions read the data in one format, then convert it to
> another in a Project operator. Using a column type might let the reader do
> the conversion -- something that is easy to implement if using the "result
> set loader" mechanism.)
>
> Plus, the syntax fits nicely into the existing view file structure. If the
> types appear in views, then client tools can continue to use standard SQL
> without the type information.
>
> When this idea came up in the past, someone mentioned the issue of
> nullable vs. non-nullable. (Let's also include arrays, since Drill supports
> that. Maybe add a suffix to the the name:
>
> SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ...
>
> Not pretty, but works with the existing SQL syntax rules.
>
> Obviously, Drill has much on its plate, so not suggestion that Drill
> should do this soon. Just passing it along as yet another option to
> consider.
>
> Thanks,
> - Paul
>
> [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message