drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dobes Vandermeer" <dob...@gmail.com>
Subject Re: Thought about schemaless sources (mongodb/json)
Date Mon, 26 Oct 2020 18:57:00 GMT
Hi Paul,

I think you misunderstood my proposal.  I was suggesting that each whole row would be provided
as a single column of TEXT.

So if your JSON is like:

{"foo":2, "bar": 3}

The schema would initially be just a single column, maybe row TEXT NOT NULL.

Then when you operate on that data you would use projection functions, (e.g. similar to the
ones in PostgreSQL https://www.postgresql.org/docs/9.5/functions-json.html) like:

SELECT SUM(CASE WHEN JSON_TYPEOF(row, '$.bar') = 'number') THEN JSON_EXTRACT_PATH(row, '$.bar')

A similar approach could be taken for CSV data although you would want the column headings
to be available, which makes it a bit more complicated but I think a system could be designed
for that.

I don't think I would propose this as a the default way of handling JSON/CSV necessarily.
For people lucky enough to have data that adheres to a strict universal schema with no missing
fields / nulls the current approach is more convenient since you can use a better syntax.

For people with less perfect data this approach allows them to "clean up" data using a view
or a CREATE TABLE AS as parse the JSON / MongoDB data using custom logic to produce a parquet
file with a fully specified schema.

On 10/26/2020 10:54:00 AM, Paul Rogers <par0328@gmail.com> wrote:
Hi Dobes,

Thanks for the idea: a text-only approach would certainly solve one class
of problems. As it turns out, CSV does read all columns as text.

The solution does not solve two other problems which have popped up from
time to time. First, if file A contains columns a and b, and file B
contains columns a, b, and c, then the CSV reader for A does not know to
read column c. Some other operator in Drill will fill in the column, and
will choose Nullable INT as the type. That could, however, be changed to
choose Nullable Varchar.

Second, if a batch of A columns is given to the client before Drill sees
the first file B rows, then the client will see a schema of (a, b) followed
by a schema of (a, b, c.) JDBC and ODBC clients can't handle this. The REST
API handles this, but at the cost of buffering all rows in memory before
sending, which causes its own issues, as someone recently noted.

JSON does have an all-text mode which suffers from the same missing-column
issue. Since Mongo uses JDBC, it should also support all-text mode.

Mongo is a bit tricky because it allows an object syntax to specify
scalars: {type: "int", value: 10} (IIRC). Drill does not read such values
as strings, even in all-text mode. The result can be a type conversion
error if some values use the extended from, others use the simple form. So,
Mongo would need special handling. There was a start at fixing this before
our committer reviewers dried up. We could dust off the work and finish it

The all-text approach avoids type conversion issues because there is no
type conversion. Users, however, want to do math and other operations which
require numeric types. So, an inconvenience with the all-text approach is
that the user must include CASTs in every query to convert the data to the
proper type. Doing so makes queries more complex and slower.

You can work around the CAST problem by creating a view: the view contains
all the needed CASTs. The user references the view instead of the actual

Of course, if you are going to define a view, you might as well go all the
way and use the other approach you mentioned, which Drill started to
support: tell the CSV or JSON reader what schema to expect. That way, the
reader does the conversion (as for JSON), and does so reliably. Since the
schema is known at read time, all batches have the same schema, which
solves the "schema change" problem.

The key challenge is that, to solve these problems, Drill needs information
not available in the raw data. The question is: what is the most reliable,
least complex way to supply that information? All-text mode, with
pre-defined conversions, and a list of columns to expect would provide
Drill with the needed information.

One way to gather the information would be to extend to do a "sniff" pass
over the data to infer a schema across the set of files to scan, work out
any inconsistencies, then do the real scan with a reliable, complete schema.

Doing the "sniff" pass for every query is slow; it would be better to do
the "sniff" pass once and reuse the information. Drill tries to do this
with Parquet metadata. AWS recognized this same problem. The AWS Glue
product, based on Hive, will "sniff" your S3 data to infer a schema which
is then stored in HMS and available for tools to use. Drill could tie into
Glue to obtain the information so that the user need not create it.

So, several issues to consider and several ways to address them. It does
look like the key challenge is what you identified: to provide Drill with
information not available in that first file record.


- Paul

On Mon, Oct 26, 2020 at 1:42 AM Dobes Vandermeer wrote:

> Currently drill tries to infer schemas from data that doesn't come with
> one, such as JSON, CSV, and mongoDB. However this doesn't work well if the
> first N rows are missing values for fields - drill just assigns an
> arbitrary type to fields that are only null and no type to fields that are
> missing completely, then rejects values when it finds them later.
> What if you could instead query in a mode where each row is just given as
> a string, and you use JSON functions to load the data out and convert or
> cast it to the appropriate type?
> For JSON in particular it's common these days to provide functions that
> extract data from a JSON string column. BigQuery and postgres are two good
> examples.
> I think in many cases these JSON functions could be inspected by a driver
> and still be used for filter push
> down.
> Anyway, just an idea I had to approach the mongo schema problem that's a
> bit different from trying to specify the schema up front. I think this
> approach offers more flexibility to the user at the cost of more verbose
> syntax and harder to optimize queries.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message