drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Wu <jiang...@mulesoft.com.INVALID>
Subject Re: About integration of drill and arrow
Date Thu, 16 Jan 2020 20:28:37 GMT
We experienced the same issue on first "null" value tripping up Drill's
schema on-demand logic that assumes certain data type for "null" causing
subsequent exception when a non-null is encountered.  We added logic inside
our storage plugin to make this work.  But good to hear there is a general
solution being worked on.  Logically, one can defer the type determination
until first non-null value.

Regarding "If Drill had a schema", I do fully understand the need for
schemas for purposes of vectorization.  Still having schema on demand has
been a major benefit for our use case.  So the tricky part is how to
achieve the best of both worlds.

-- Jiang


On Wed, Jan 15, 2020 at 7:09 PM Paul Rogers <par0328@yahoo.com.invalid>
wrote:

> Hi Ted,
>
> Thanks much for the feedback! Good test cases indeed. The good news is
> that we're close to finishing a "V2" JSON reader that smooths over a few
> more JSON quirks like the "first column null" issue that can cause problems:
>
> {n: 1, v: null}{n: 2, v: "Gotcha! Wasn't Int, is actually VARCHAR!"}
>
> Glad your queries work. You gave an example that had fooled me multiple
> times:
>
> select nest.u from dfs.root.`/Users/tdunning/x.json`;
>
>
> The trick here is that Drill has no schema. All the parser can tell is,
> "hey, I've got a two-part name, `nest.u`. For me a two part name means
> schema.table or table.column, so, since `nest` isn't a schema, it must be a
> table.Oh, look, no such table exists. FAIL!" Using a three-part name works
> (IIRC):
>
> select t.nest.u from dfs.root.`/Users/tdunning/x.json` t;
>
>
> Now Drill sees that `t` is a table name, and works its way down from there.
>
>
> If Drill had a schema, then the planner could first check if `nest` is a
> schema, then if it is a table, then if it is a structured field in the
> query. Impala can do this because it has a schema; Drill can't. We can hope
> that, with the new schema work being added to Drill, that your query will
> "do the right thing" in the future.
>
> Adding `columns` to your query won't help: the `columns` name is valid in
> only one place: when working with CSV (or, more generally, delimited) data
> with no headers.
>
> This gets back to Jaing's point: we could really use better/more
> documentation. We're good at the bare basics, "such-and-so syntax exists",
> but we're not as good at explaining how to solve problems using Drill
> features. The Learning Apache Drill book tries to address some holes.
> Clearly, if you have a hard time with this, being part of the team that
> created Drill, we've got a bit of work to do! (To be honest, neither Impala
> nor Presto are much better in the "how to" department.)
>
>
> Additional use cases/frustrations are very welcome as you find them.
>
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, January 15, 2020, 3:44:09 PM PST, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>  On Wed, Jan 15, 2020 at 2:58 PM Paul Rogers <par0328@yahoo.com.invalid>
> wrote:
>
> > ...
> >
> > For example, Ted, you mention lack of nullability on structure members.
> > But, Drill represents structures as MAPs, and MAPs can have nullable
> > members. So, there is likely more to your request than the short summary
> > suggests. Perhaps you can help us understand this a bit more.
> >
>
> This was quite a while ago.
>
> I was reading JSON data with substructures of variable form.
>
> I think, however, that this impression is old news. I just tried it and it
> works the way I wanted.
>
> Here is my data:
>
> {"top":"a","nest":{"u":1, "v":"other"}}
> {"top":"b","nest":{"v":"this", "w":"that"}}
>
> And here are some queries that behave just the way that I wanted:
>
> apache drill> *select* * *from* dfs.root.`/Users/tdunning/x.json`;
>
> +-----+-------------------------+
>
> | *top* | *        nest          * |
>
> +-----+-------------------------+
>
> | a  | {"u":1,"v":"other"}    |
>
> | b  | {"v":"this","w":"that"} |
>
> +-----+-------------------------+
>
> 2 rows selected (0.079 seconds)
>
> apache drill> *select* nest *from* dfs.root.`/Users/tdunning/x.json`;
>
> +-------------------------+
>
> | *        nest          * |
>
> +-------------------------+
>
> | {"u":1,"v":"other"}    |
>
> | {"v":"this","w":"that"} |
>
> +-------------------------+
>
> 2 rows selected (0.114 seconds)
>
> apache drill> *select* nest.u *from* dfs.root.`/Users/tdunning/x.json`;
>
> Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 11: Table
> 'nest' not found
>
>
>
> [Error Id: b2100faf-adf7-453e-957f-56726b96e06f ] (state=,code=0)
>
> apache drill> *select* columns.nest.u *from* dfs.root.
> `/Users/tdunning/x.json`;
>
> Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 14: Table
> 'columns' not found
>
>
>
> [Error Id: a793e6bd-c2ed-477a-9f23-70d67b2b85df ] (state=,code=0)
>
> apache drill> *select* x.nest.u *from* dfs.root.`/Users/tdunning/x.json` x;
>
> +--------+
>
> | *EXPR$0* |
>
> +--------+
>
> | 1      |
>
> | null  |
>
> +--------+
>
> 2 rows selected (0.126 seconds)
> apache drill>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message