Subject [GitHub] [drill] paul-rogers commented on a change in pull request #1953: Add docs for Drill Metastore
Date Tue, 04 Feb 2020 03:12:30 GMT
paul-rogers commented on a change in pull request #1953: Add docs for Drill Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r374454183

 File path: _docs/performance-tuning/drill-metastore/030-drill-iceberg-metastore.md
 @@ -0,0 +1,69 @@
+title: "Drill Iceberg Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+Drill uses Iceberg Metastore implementation based on [Iceberg tables](http://iceberg.incubator.apache.org).
For Drill 1.17,
+ this is default Drill Metastore implementation. For details on how to configure Iceberg
Metastore implementation and
+ its option descriptions, please refer to [Iceberg Metastore docs](https://github.com/apache/drill/blob/master/metastore/iceberg-metastore/README.md).
+{% include startnote.html %}
+Iceberg table supports concurrent writes and transactions but they are only effective on
file systems that support
+ atomic rename.
+If the file system does not support atomic rename, it could lead to inconsistencies during
concurrent writes.
+{% include endnote.html %}
+### Iceberg Tables Location
+Iceberg tables will reside on the file system in the location based on
+Iceberg Metastore base location `drill.metastore.iceberg.location.base_path` and component
specific location.
+If Iceberg Metastore base location is `/drill/metastore/iceberg`
+and tables component location is `tables`. Iceberg table for tables component
+will be located in `/drill/metastore/iceberg/tables` folder.
+Metastore metadata will be stored inside Iceberg table location provided
+in the configuration file. Drill table metadata location will be constructed
+based on specific component storage keys. For example, for `tables` component,
+storage keys are storage plugin, workspace and table name: unique table identifier in Drill.
+Assume Iceberg table location is `/drill/metastore/iceberg/tables`, metadata for the table
+`dfs.tmp.nation` will be stored in the `/drill/metastore/iceberg/tables/dfs/tmp/nation` folder.
+Example of base Metastore configuration file `drill-metastore-override.conf`, where Iceberg
tables will be stored in
+ hdfs:
+drill.metastore.iceberg: {
+  config.properties: {
+    fs.defaultFS: "hdfs:///"
+  }
+  location: {
+    base_path: "/drill/metastore",
+    relative_path: "iceberg"
+  }
+### Metadata Storage Format
+Iceberg tables support data storage in three formats: Parquet, Avro, ORC. Drill metadata
will be stored in Parquet files.
+This format was chosen over others since it is column oriented and efficient in terms of
disk I/O when specific
+columns need to be queried.
+Each Parquet file will hold information for one partition. Partition keys will depend on
+component characteristics. For example, for tables component, partitions keys are storage
plugin, workspace,
+table name and metadata key.
+Parquet files name will be based on UUID to ensure uniqueness. If somehow collision occurs,
modify operation
+in Metastore will fail.
 Review comment:
   Good info, but unclear. First, please explain what is meant by the Parquet file. Iceberg
is a file system within a file, right? So, the user can never see the Parquet files? If so,
then this section is moot: the user can't do anything with the information.
   However, if Iceberg provides zip-like utilities to inspect the Iceberg file, then we can
tell the user how to use them. Then we can explain what they will see.
   I did not follow the file format. There must be a file for the table itself, right? That
has schema, etc?
   Then, there is a file for each partition? What is a "metadata key"? Is this the concatenated
directory names? If I have "mytable/2016/12/01/files.parquet", will my partition key be "2016/12/01"?
If so, partition keys *must* be unique: the file system demands it. However, if the key is
"20161201", then the name can be ambiguous, but this is self-inflicted.
   Then, where does the UUID fit in? Do we have a table from partition directory keys to UUIDs?
   Since the user has no control, and relies on us to make things work, the sentence about
failure can be removed.

