drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [drill] paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement metadata usage for all format plugins
Date Sat, 14 Mar 2020 22:22:15 GMT
paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement metadata usage
for all format plugins
URL: https://github.com/apache/drill/pull/2026#discussion_r392623969

 File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyGroupScan.java
 @@ -124,13 +127,16 @@ public EasyGroupScan(
       // use file system metadata provider without specified schema and statistics
       metadataProviderManager = new FileSystemMetadataProviderManager();
-    SimpleFileTableMetadataProviderBuilder builder =
-        (SimpleFileTableMetadataProviderBuilder) metadataProviderManager.builder(
-            MetadataProviderManager.MetadataProviderKind.SCHEMA_STATS_ONLY);
+    DrillFileSystem fs =
+        ImpersonationUtil.createFileSystem(ImpersonationUtil.resolveUserName(userName), formatPlugin.getFsConf());
-    this.metadataProvider = builder.withLocation(selection.getSelectionRoot())
+    this.metadataProvider = tableMetadataProviderBuilder(metadataProviderManager)
+        .withSelection(selection)
+        .withFileSystem(fs)
+    this.usedMetastore = metadataProviderManager.usesMetastore();
     initFromSelection(selection, formatPlugin);
+    checkMetadataConsistency(selection, formatPlugin.getFsConf());
 Review comment:
   This has been nagging at me. For Parquet, metadata includes both partition information
and information about the insides of files (row groups, etc.) But, for files other than Parquet,
there is no useful information in metadata about file contents. As a result, all of the benefit
of metadata is to assist with partition pruning. Metadata avoids the need to walk the directory
   However, in order to ensure that the metadata is consistent we... walk the directory tree.
   So, for files other than Parquet, are we gaining anything (other than more complexity)
by using metadata if we must check the tree on each query?
   There *is* a gain if we can trust the metadata and avoid the walk of the directory tree.
(See comments elsewhere which no longer appear in this code view since my comments overlapped
with your next round of changes.)

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

View raw message