drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dobes Vandermeer" <dob...@gmail.com>
Subject Re: Thought about schemaless sources (mongodb/json)
Date Mon, 26 Oct 2020 19:25:48 GMT
Hi Paul,

Good points all.  I'll consider experimenting with this if my orbit passes closer to drill
again in the future.

On 10/26/2020 12:12:39 PM, Paul Rogers <par0328@gmail.com> wrote:
Hi Dobes,

The good news is that, with the "new" JSON reader we added recently, it
would be pretty easy to read each row as text.

Most of what I said earlier still applies, however. Drill has a function to
parse JSON into fields. It suffers from the same limitations as the JSON
readers (becaues the code is identical.)

You could write code to pull out each column as is done in DBs. That way,
Drill knows the set of columns because they appear explicitly in the SELECT
clause. All good. This, however, introduces an extreme form of the CAST
issue: users must know not only how to CAST, but also how to parse JSON.

One drawback of the field-by-field way of extracting JSON is that doing so
is very inefficient: the JSON must be parsed for each column extraction.
Perhaps a clever optimizer could combine all such expressions into a single
combined operation that parses the JSON once. Of course, if we do that,
we're kind of back to what the JSON reader always does.

Further, if the JSON fields are inconsistent, you need to add conditional
code to handle cases such as the Mongo structured constants, or the case
where a column is both a text "10" and a number 20. This can be done, it
just makes each SQL statement more complex, which may or may not be a
problem. Again, a view could hide this complexity.

Still, simple enough to give the idea a try. 1) make that JSON reader
change using the new JSON reader. 2) Add a "parse JSON field" function as a
UDF. Java libraries exist to do the heavy lifting. You could then try out
the idea to see how it works.


- Paul

On Mon, Oct 26, 2020 at 11:57 AM Dobes Vandermeer wrote:

> Hi Paul,
> I think you misunderstood my proposal. I was suggesting that each whole
> row would be provided as a single column of TEXT.
> So if your JSON is like:
> {"foo":1}
> {"foo":2, "bar": 3}
> The schema would initially be just a single column, maybe row TEXT NOT
> Then when you operate on that data you would use projection functions,
> (e.g. similar to the ones in PostgreSQL
> https://www.postgresql.org/docs/9.5/functions-json.html) like:
> SELECT SUM(CASE WHEN JSON_TYPEOF(row, '$.bar') = 'number') THEN
> '$.foo') AS INTEGER END FROM ...
> A similar approach could be taken for CSV data although you would want the
> column headings to be available, which makes it a bit more complicated but
> I think a system could be designed for that.
> I don't think I would propose this as a the default way of handling
> JSON/CSV necessarily. For people lucky enough to have data that adheres to
> a strict universal schema with no missing fields / nulls the current
> approach is more convenient since you can use a better syntax.
> For people with less perfect data this approach allows them to "clean up"
> data using a view or a CREATE TABLE AS as parse the JSON / MongoDB data
> using custom logic to produce a parquet file with a fully specified schema.
> On 10/26/2020 10:54:00 AM, Paul Rogers wrote:
> Hi Dobes,
> Thanks for the idea: a text-only approach would certainly solve one class
> of problems. As it turns out, CSV does read all columns as text.
> The solution does not solve two other problems which have popped up from
> time to time. First, if file A contains columns a and b, and file B
> contains columns a, b, and c, then the CSV reader for A does not know to
> read column c. Some other operator in Drill will fill in the column, and
> will choose Nullable INT as the type. That could, however, be changed to
> choose Nullable Varchar.
> Second, if a batch of A columns is given to the client before Drill sees
> the first file B rows, then the client will see a schema of (a, b) followed
> by a schema of (a, b, c.) JDBC and ODBC clients can't handle this. The REST
> API handles this, but at the cost of buffering all rows in memory before
> sending, which causes its own issues, as someone recently noted.
> JSON does have an all-text mode which suffers from the same missing-column
> issue. Since Mongo uses JDBC, it should also support all-text mode.
> Mongo is a bit tricky because it allows an object syntax to specify
> scalars: {type: "int", value: 10} (IIRC). Drill does not read such values
> as strings, even in all-text mode. The result can be a type conversion
> error if some values use the extended from, others use the simple form. So,
> Mongo would need special handling. There was a start at fixing this before
> our committer reviewers dried up. We could dust off the work and finish it
> up.
> The all-text approach avoids type conversion issues because there is no
> type conversion. Users, however, want to do math and other operations which
> require numeric types. So, an inconvenience with the all-text approach is
> that the user must include CASTs in every query to convert the data to the
> proper type. Doing so makes queries more complex and slower.
> You can work around the CAST problem by creating a view: the view contains
> all the needed CASTs. The user references the view instead of the actual
> file.
> Of course, if you are going to define a view, you might as well go all the
> way and use the other approach you mentioned, which Drill started to
> support: tell the CSV or JSON reader what schema to expect. That way, the
> reader does the conversion (as for JSON), and does so reliably. Since the
> schema is known at read time, all batches have the same schema, which
> solves the "schema change" problem.
> The key challenge is that, to solve these problems, Drill needs information
> not available in the raw data. The question is: what is the most reliable,
> least complex way to supply that information? All-text mode, with
> pre-defined conversions, and a list of columns to expect would provide
> Drill with the needed information.
> One way to gather the information would be to extend to do a "sniff" pass
> over the data to infer a schema across the set of files to scan, work out
> any inconsistencies, then do the real scan with a reliable, complete
> schema.
> Doing the "sniff" pass for every query is slow; it would be better to do
> the "sniff" pass once and reuse the information. Drill tries to do this
> with Parquet metadata. AWS recognized this same problem. The AWS Glue
> product, based on Hive, will "sniff" your S3 data to infer a schema which
> is then stored in HMS and available for tools to use. Drill could tie into
> Glue to obtain the information so that the user need not create it.
> So, several issues to consider and several ways to address them. It does
> look like the key challenge is what you identified: to provide Drill with
> information not available in that first file record.
> Thanks,
> - Paul
> On Mon, Oct 26, 2020 at 1:42 AM Dobes Vandermeer wrote:
> > Currently drill tries to infer schemas from data that doesn't come with
> > one, such as JSON, CSV, and mongoDB. However this doesn't work well if
> the
> > first N rows are missing values for fields - drill just assigns an
> > arbitrary type to fields that are only null and no type to fields that
> are
> > missing completely, then rejects values when it finds them later.
> >
> > What if you could instead query in a mode where each row is just given as
> > a string, and you use JSON functions to load the data out and convert or
> > cast it to the appropriate type?
> >
> > For JSON in particular it's common these days to provide functions that
> > extract data from a JSON string column. BigQuery and postgres are two
> good
> > examples.
> >
> > I think in many cases these JSON functions could be inspected by a driver
> > and still be used for filter push
> > down.
> >
> > Anyway, just an idea I had to approach the mongo schema problem that's a
> > bit different from trying to specify the schema up front. I think this
> > approach offers more flexibility to the user at the cost of more verbose
> > syntax and harder to optimize queries.
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message