drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <julianh...@gmail.com>
Subject Re: Apache Drill and Schema
Date Mon, 13 May 2013 20:03:34 GMT
Gunnar,

I've come across some interesting use cases putting Optiq on top of Drill and Splunk. They
help explain my philosophy for putting a strongly typed language (SQL) on top of systems that
are weakly typed. If one tried to remove strong typing from SQL, one would end up with a different
language, just as JavaScript is different from Java. But weakly typed systems clearly have
their advantages; the problem is how to surface those benefits without abandoning SQL.

(For what it's worth, I believe that this philosophy is consistent with Jacques' vision for
Drill.)

Optiq implements the SQL specification, and as such is strongly typed. Splunk and Drill are
weakly typed, but in different ways.

In Splunk, there are a few core system fields, but every record can have a different set of
fields. In a sense, the whole Splunk system is a single table with hundreds or thousands of
fields, the set of distinct fields across all records. And of course that set of fields may
change each time a record is added or removed.

The most convenient way to model a Splunk table was like this:

CREATE TABLE splunk (
  source VARCHAR NOT NULL,
  sourcetype VARCHAR NOT NULL,
  _extra VARCHAR);

There are two built-in fields "source" and "sourcetype", present in every record, and the
"_extra ANY" declaration tells Optiq that you can ask for any other field. Those extra fields
have type VARCHAR (in splunk all fields are strings) and their value in SQL will be NULL if
they don't exist in a particular row. Thus you can write

SELECT s."source", s."sourcetype", s."action", CAST(s."discount" AS INTEGER)
FROM "splunk"."splunk" AS s
WHERE s."action" = 'purchase'

To implement the Optiq SQL front-end to Drill, I added support for the JSON object model to
SQL by adding MAP and ARRAY type constructors and a variant type ANY. Thus Drill's "donuts"
example table appears in SQL as

CREATE TABLE donuts (
   _map MAP(VARCHAR, ANY) NOT NULL);

This is strongly typed, and therefore consistent with the SQL standard. Every row has a single
column called "_map", and its type is MAP(VARCHAR, ANY). (Equivalent to a Java field "final
Map<String, Object> map;".) If you want to access the fields of the "_map" column you
use the "[ ... ]" operator and casting.

To access the "ppu" field, you could write "SELECT CAST(_map['ppu'] AS INTEGER) as ppu FROM
donuts". I would like to add syntactic sugar so that you can write "SELECT ppu FROM donuts"
(implicitly inserting the "_map" field, just as Java inserts "this." before field references,
and converting "x.field" to "x['field']" for any field of type MAP).

If later you know the schema, you can create a view:

CREATE VIEW donuts_typed AS
  SELECT CAST(_map['name'] AS VARCHAR) AS name,
    CAST(_map['ppu'] AS FLOAT) AS ppu
FROM donuts

You can build your applications on top of such views at little or no loss of efficiency.

These illustrate ways that we can add strongly-typed standard SQL on top of a system with
late or no schema. It is still standard SQL, which everyone agrees is very important. The
sugarings don't damage the core language; if you don't like them, don't use them.

The biggest shortfall between the systems I've described above and what you seem to want is
enumerating what columns are available. (Your examples all contain 'select *'.) In strongly
typed SQL, if you want to know what columns are available, you ask the catalog (i.e. you use
metadata). In a weakly typed system, you issue a query (i.e. you use data). If you issue "select
*" against Optiq-for-Splunk or Optiq-for-Drill you will be disappointed. The column list will
expand to "source, sourcetype" and "_map" respectively, because that is the only metadata
known to the system before the query is run.

The easiest solution is to ask your users to specify explicit columns rather than writing
"select *". (Machine-generated queries, such as my own Mondrian engine, already tend to do
this.) Or, as I suggested earlier, go ahead and provide strongly typed views for your users
to use. Or, one could build a UI to assist writing queries: the UI could suggest which columns
to ask for, based on sketches of the data, previous query results, and columns that are already
used in the query.

Julian


Mime
View raw message