drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [drill] paul-rogers commented on issue #2026: DRILL-7330: Implement metadata usage for all format plugins
Date Sat, 14 Mar 2020 06:32:35 GMT
paul-rogers commented on issue #2026: DRILL-7330: Implement metadata usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#issuecomment-599018591
 
 
   Looks like a very cool feature. I've not been following the metadata implementation closely.
Can you help get me up to speed by providing a bit more background information? What is the
goal of this PR? Does it enable the format plugins to gather metadata if they choose, or does
this PR actually add the metadata gathering itself?
   
   As I understand it, one of the things that the metadata framework does is to infer schema.
Whether inferring schema for metadata, or inferring schema for a scan, we hit the same ambiguities.
How does this code handle a schema conflict? Or, do we just assume the schema is whatever
we get in the sample?
   
   How do we gather stats? Do we have the reader read all the data and have a downstream operator
make sense of the data?
   
   For files that need a provided schema (CSV, say), do we apply stats to the columns after
type conversion, or are stats gathered on the raw text values? That is, does this work use
the provided schema if available? How does the provided schema relate to the metadata schema?
   
   What stats will we gather for non-Parquet files? How will we use them? Looks like there
is code for partitions (have not looked in depth, so I may be wrong). Are we using stats for
partition pruning? If so, how does that differ from the existing practice of just walking
the directory tree?
   
   I think that if I understand some of this background I'll be able to do a more complete
review. Thanks!
   
   Just so we're on the same page, I'm working on a revision to how we handle schema. Basically,
the EVF-based operators will fully integrate the provided schema, and will be ready for a
"defined" schema created by the planner (as in a classic query engine where the planner does
all the schema calculations.) The idea is to use dynamic schema (what Drill has always done)
when sampling the first row tells us all we need to know (as in Parquet), but to encourage
a provided schema when sampling is not reliable (as in JSON.)
   
   This means that we have a flow something like this:
   
   ```
   User --> Provided Schema --> Scan <-- Reader <-- Input Source Schema
                                  |
                                  v
                        Scan output schema
   ```
   The scan output schema describes the data a scan will deliver. Hopefully, this is also
the schema used by stats gathering.
   
   Do you see any potential conflicts between your metadata model and the above provided schema
model?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message