drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [drill] vvysotskyi commented on issue #2026: DRILL-7330: Implement metadata usage for all format plugins
Date Sat, 14 Mar 2020 09:26:15 GMT
vvysotskyi commented on issue #2026: DRILL-7330: Implement metadata usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#issuecomment-599032547
 
 
   @paul-rogers, this pull request enables the format plugin to gather metadata. Metadata
gathering logic was added in DRILL-7273.
   
   Regarding the schema, when metadata is collecting, rules are the same as for regular select
queries - Drill tries to infer the table schema or uses user-provided schema.
   
   Collecting metadata logic may become clearer after reading this section of docs: https://github.com/apache/drill/blob/master/docs/dev/MetastoreAnalyze.md#analyze-operators-description
or this design doc: https://docs.google.com/document/d/14pSIzKqDltjLEEpEebwmKnsDPxyS_6jGrPOjXu6M_NM/edit?usp=sharing
   In short, yes, we use a reader that reads all the data and downstream operators for transforming
and storing its statistics.
   
   > For files that need a provided schema (CSV, say), do we apply stats to the columns
after type conversion, or are stats gathered on the raw text values? That is, does this work
use the provided schema if available?
   
   Yes, we apply stats to the columns after schema conversion, so such stats as min/max would
have correct values in the scope of natural ordering.
   
   > How does the provided schema relate to the metadata schema?
   
   After the provided schema is used in the scan, Drill will use the resolved schema for columns
and store it to the metastore.
   
   > What stats will we gather for non-Parquet files? How will we use them? Looks like
there is code for partitions (have not looked in depth, so I may be wrong). Are we using stats
for partition pruning? If so, how does that differ from the existing practice of just walking
the directory tree?
   
   We collect exactly the same stats for non-parquet files. We may use them in the same way
as it is used in parquet - prune files when filter for specific columns is specified, prune
unneeded files for limit queries. Dirs pruning would still work in the same way as it worked
before changes (it also works for parquet).
   I think some tests in `TestMetastoreWithEasyFormatPlugin` will help to understand which
optimizations are added.
   
   > Do you see any potential conflicts between your metadata model and the above provided
schema model?
   
   Looks like there shouldn't be any conflicts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message