drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [drill] paul-rogers commented on a change in pull request #1953: Add docs for Drill Metastore
Date Tue, 04 Feb 2020 03:12:30 GMT
paul-rogers commented on a change in pull request #1953: Add docs for Drill Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r374454183
 
 

 ##########
 File path: _docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md
 ##########
 @@ -0,0 +1,69 @@
+---
+title: "Drill Iceberg Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill uses Iceberg Metastore implementation based on [Iceberg tables](http://iceberg.incubator.apache.org).
For Drill 1.17,
+ this is default Drill Metastore implementation. For details on how to configure Iceberg
Metastore implementation and
+ its option descriptions, please refer to [Iceberg Metastore docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md).
+
+{% include startnote.html %}
+Iceberg table supports concurrent writes and transactions but they are only effective on
file systems that support
+ atomic rename.
+If the file system does not support atomic rename, it could lead to inconsistencies during
concurrent writes.
+{% include endnote.html %}
+
+### Iceberg Tables Location
+
+Iceberg tables will reside on the file system in the location based on
+Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` and component
specific location.
+If Iceberg Metastore base location is `/drill/metastore/iceberg`
+and tables component location is `tables`. Iceberg table for tables component
+will be located in `/drill/metastore/iceberg/tables` folder.
+
+Metastore metadata will be stored inside Iceberg table location provided
+in the configuration file. Drill table metadata location will be constructed
+based on specific component storage keys. For example, for `tables` component,
+storage keys are storage plugin, workspace and table name: unique table identifier in Drill.
+
+Assume Iceberg table location is `/drill/metastore/iceberg/tables`, metadata for the table
+`dfs.tmp.nation` will be stored in the `/drill/metastore/iceberg/tables/dfs/tmp/nation` folder.
+
+Example of base Metastore configuration file `drill-metastore-override.conf`, where Iceberg
tables will be stored in
+ hdfs:
+
+```
+drill.metastore.iceberg: {
+  config.properties: {
+    fs.defaultFS: "hdfs:///"
+  }
+
+  location: {
+    base_path: "/drill/metastore",
+    relative_path: "iceberg"
+  }
+}
+```
+
+### Metadata Storage Format
+
+Iceberg tables support data storage in three formats: Parquet, Avro, ORC. Drill metadata
will be stored in Parquet files.
+This format was chosen over others since it is column oriented and efficient in terms of
disk I/O when specific
+columns need to be queried.
+
+Each Parquet file will hold information for one partition. Partition keys will depend on
Metastore
+component characteristics. For example, for tables component, partitions keys are storage
plugin, workspace,
+table name and metadata key.
+
+Parquet files name will be based on UUID to ensure uniqueness. If somehow collision occurs,
modify operation
+in Metastore will fail.
 
 Review comment:
   Good info, but unclear. First, please explain what is meant by the Parquet file. Iceberg
is a file system within a file, right? So, the user can never see the Parquet files? If so,
then this section is moot: the user can't do anything with the information.
   
   However, if Iceberg provides zip-like utilities to inspect the Iceberg file, then we can
tell the user how to use them. Then we can explain what they will see.
   
   I did not follow the file format. There must be a file for the table itself, right? That
has schema, etc?
   
   Then, there is a file for each partition? What is a "metadata key"? Is this the concatenated
directory names? If I have "mytable/2016/12/01/files.parquet", will my partition key be "2016/12/01"?
If so, partition keys *must* be unique: the file system demands it. However, if the key is
"20161201", then the name can be ambiguous, but this is self-inflicted.
   
   Then, where does the UUID fit in? Do we have a table from partition directory keys to UUIDs?
   
   Since the user has no control, and relies on us to make things work, the sentence about
failure can be removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message