drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kunal Khatua" <ku...@apache.org>
Subject Re: Schema support in storage and format plugins
Date Mon, 03 Jun 2019 21:25:24 GMT
Hi Paul

Adding this to the user mailing list as well, to broaden the reach of this thread.

~ Kunal
On 5/27/2019 6:23:50 PM, Paul Rogers <par0328@yahoo.com.invalid> wrote:
Hi All,

Drill 1.16 introduced the the "provided schema" mechanism to help you query the kind of messy
files found in the real world. Arina and Bridget created nice documentation [1] for the feature.
Sorabh presented the feature at the recent Drill Meetup. If you are a plugin developer, we
need your help to expand the feature to other plugins.

To understand the need for the feature, it helps to remember that are two popular ways to
query data in a distributed file system (DFS) such as Hadoop: direct query or ETL.

Most major query engines require an ETL step: use Hive or Spark to transform your data into
a standard format such as Parquet or ORC. Then, use a tool such as Drill (or Impala, Presto,
Hive LLAP, Snowflake, Big Query, etc.) to query the data. The ETL approach works well, but
it has a cost: you must maintain multiple copies of the data, manage an ETL pipeline, and
so on. This cost is justified if your users query the data frequently, as in the classic "data
warehouse" use case.

There are other use cases (such as log analysis, data exploration, data science) where the
benefit of the two-step ETL process is less clear. These use cases are better served by directly
querying your "raw" data. Here your choices are mostly Drill, Spark or the original Hive.
Although Spark is very powerful, Drill is far easier to use for tasks that can be expressed
in SQL using your favorite BI tool.

Drill's "schema on read" (AKA "schemaless") approach allows Drill to read a data file directly:
just point Drill at a file and immediately run queries on that file. However, we've seen over
the years that files can be messy or ambiguous. Let's look at the two most common problems
and how the provided schema solves them: schema evolution and ambiguous data.

Schema evolution occurs when, say, a table started with two columns (a, b), then newer versions
added a third column (c, say). If you query SELECT a, b, c FROM ..., Drill has to guess a
type for column c in the old files (without the column). Drill generally guesses Nullable
Int. But, if the column is actually VarChar, then a schema conflict (AKA "hard schema change")
will occur and your query may fail. With a provided schema, you can tell Drill that column
"c" is a VarChar, and even provide a default value. Now, Drill knows what to do for files
without column "c".

Another kind of ambiguity occurs when Drill attempts to guess a data type from looking at
the first few rows of a file. The classic JSON example is a two-record file: {a: 10} {a: 10.1}
-- a column starts as an INT, but then we want to store FLOAT data into it, causing an error.
With a hint, the user can just declare the column as FLOAT, avoiding the ambiguity.

The "provided schema" feature solves these problems by supplying hints about how interpret
a file. The feature avoids heavy-weight cost of the Hive metastore (HMS) that is used by Hive,
Impala and Presto. Instead, the schema is a simple file stored directly in the DFS alongside
your data.

You can enable schema support in a plugin by using the new "enhanced vector framework" (EVF)
(AKA the "row set framework" or the "new scan framework".) This framework was originally developed
to control reader memory use by limiting batch and vector size and to minimize vector memory
fragmentation. Solving those problems turned out to also solve the problems needed to support
a provided schema.

We are actively working to prepare the EVF for your use. We are converting the Log (regex)
format plugin and preparing a tutorial based on that conversion. (The log reader was the basis
of the format plugin chapter of the Learning Apache Drill book, so it is a good choice for
the EVF tutorial.)

If you are a user, please try out the feature on text files and let us know how it works for
you. That way, we an address any issues before we convert the other plugins.

- Paul

[1] https://drill.apache.org/docs/create-or-replace-schema/

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message