drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Possible way to specify column types in query
Date Mon, 10 Sep 2018 04:42:11 GMT
Hi Weijie,

Thanks for the paper pointer. F1 uses the same syntax as Scope (the system cited in my earlier
note): data type after the name.

Another description is [1]. Neither paper describe how F1 handles arrays. However, this second
paper points out that Protobuf is F1's native format, and so F1 has support for nested types.
Drill does also, but in Drill, a reference to "customer.phone.cell" cause the nested "cell"
column to be projected as a top-level column. And, neither paper say whether F1 is used with
O/JDBC, and if so, how they handle the mapping from nested types to the flat tuple structure
required by xDBC.

Have you come across these details?

Thanks,
- Paul

 

    On Thursday, September 6, 2018, 8:43:57 PM PDT, weijie tong <tongweijie178@gmail.com>
wrote:  
 
 Google's latest paper about F1[1] claims to support any data sources by
using an extension api called TVF see section 6.3. Also need to declare
column datatype before the query.


[1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf

On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers <par0328@yahoo.com.invalid>
wrote:

> Hi All,
>
> We've discussed quite a few times whether Drill should or should not
> support or require schemas, and if so, how the user might express the
> schema.
>
> I came across a paper [1] that suggests a simple, elegant SQL extension:
>
> EXTRACT <column>[:<type>] {,<column>[:<type>]}
> FROM <stream_name>
>
> Paraphrasing into Drill's SQL:
>
> SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS
<alias>]}
> FROM <table_name>
>
> Have a collection of JSON files in which string column `foo` appears in
> only half the files? Don't want to get schema conflicts with VARCHAR and
> nullable INT? Just do:
>
> SELECT name:VARCHAR, age:INT, foo:VARCHAR
> FROM `my-dir` ...
>
> Not only can the syntax be used to specify the "natural" type for a
> column, it might also specify a preferred type. For example. "age:INT" says
> that "age" is an INT, even though JSON would normally parse it as a BIGINT.
> Similarly, using this syntax is a easy way to tell Drill how to convert CSV
> columns from strings to DATE, INT, FLOAT, etc. without the need for CAST
> functions. (CAST functions read the data in one format, then convert it to
> another in a Project operator. Using a column type might let the reader do
> the conversion -- something that is easy to implement if using the "result
> set loader" mechanism.)
>
> Plus, the syntax fits nicely into the existing view file structure. If the
> types appear in views, then client tools can continue to use standard SQL
> without the type information.
>
> When this idea came up in the past, someone mentioned the issue of
> nullable vs. non-nullable. (Let's also include arrays, since Drill supports
> that. Maybe add a suffix to the the name:
>
> SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ...
>
> Not pretty, but works with the existing SQL syntax rules.
>
> Obviously, Drill has much on its plate, so not suggestion that Drill
> should do this soon. Just passing it along as yet another option to
> consider.
>
> Thanks,
> - Paul
>
> [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message