drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dobes Vandermeer <dob...@gmail.com>
Subject Thought about schemaless sources (mongodb/json)
Date Mon, 26 Oct 2020 08:42:25 GMT
Currently drill tries to infer schemas from data that doesn't come with one, such as JSON,
CSV, and mongoDB.  However this doesn't work well if the first N rows are missing values
for fields - drill just assigns an arbitrary type to fields that are only null and no type
to fields that are missing completely, then rejects values when it finds them later.

What if you could instead query in a mode where each row is just given as a string, and you
use JSON functions to load the data out and convert or cast it to the appropriate type?

For JSON in particular it's common these days to provide functions that extract data from
a JSON string column.  BigQuery and postgres are two good examples.

I think in many cases these JSON functions could be inspected by a driver and still be used
for filter push

Anyway, just an idea I had to approach the mongo schema problem that's a bit different from
trying to specify the schema up front.  I think this approach offers more flexibility to
the user at the cost of more verbose syntax and harder to optimize queries.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message