trafodion-codereview mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DaveBirdsall <...@git.apache.org>
Subject [GitHub] incubator-trafodion pull request: First phase of website update.
Date Wed, 18 Nov 2015 18:04:16 GMT
Github user DaveBirdsall commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/179#discussion_r45235615
  
    --- Diff: docs/src/site/markdown/architecture-overview.md ---
    @@ -0,0 +1,154 @@
    +## Overview ##
    +
    +Trafodion provides an operational SQL engine on top of Hadoop -- a solution targeted
toward operational workloads in the Hadoop Big Data environment. Included are:
    +
    +* Fully functional ANSI SQL language support
    +* Full ACID support for read/write queries including distributed transaction protection
for multiple rows, tables and statements
    +* Heterogeneous storage engine access including native access to data stores
    +* Enhanced High Availability support for client applications
    +* Support for large data sets using optimized intra-query parallelism
    +* Performance improvements for OLTP workloads via compile and runtime optimizations
    +
    +Transaction management features include:
    +
    +* Transaction serializability using the HBase-Trx implementation of Multi-Version Concurrency
Control
    +* Transaction recovery to achieve database consistency
    +* Thread-aware transaction management support to work with multi-threaded SQL clients
    +* Non-transactional/direct access to HBase tables
    +
    +## Process Architecture ##
    +
    +The following figure depicts the Trafodion process architecture:
    +
    +![Trafodion Process Architecture](images/process-architecture.png "Trafodion Process
Architecture")
    +
    +The figure above should be interpreted as follows:
    +
    +* Client Applications talk to Trafodion via a JDBC or ODBC interface. The Trafodion drivers
implement these interfaces, using an optimized Trafodion-specific wire protocol to talk to
the Master Executor process in the SQL layer. The diagram shows a JDBC Type-4 driver configuration.
    +* The Master Executor is the root process for executing SQL statements submitted via
JDBC or ODBC. It contains a copy of the SQL compiler code. Most SQL statements are compiled
within this process. The root of any compiled query plan is also executed in the Master Executor.
    +* A few SQL statements (for example, DDL and some utilities) require a second instance
of the compiler code; this is the CMP process in the diagram.
    +* Trafodion supports several forms of execution-time parallelism. When a query plan requires
parallelism, a set of ESP (Executor Server Processes) is dynamically spawned (if not already
available). Each ESP executes a fragment of the query plan.
    +* The DTM (Distributed Transaction Management) process manages distributed transactions.
This includes log management and transaction coordination.
    +* The Storage Engine layer consists of HBase and Hadoop processes. Trafodion allows SQL
access to native HBase tables. Trafodion reads HBase metadata in order to process these tables.
Trafodion also offers its own implementation of SQL table, stored as an HBase table, for applications
that need a more efficient OLTP representation. Trafodion generates its own metadata for such
tables, and stores that in HBase.
    +
    +## Connectivity ##
    +
    +The Database Connectivity Services (DCS) framework enables applications developed for
ODBC/JDBC APIs to access a Trafodion SQL database server. DCS is a distributed service. It
uses the underlying HBase ZooKeeper instance for its definition of a cluster. [Apache ZooKeeper]
(http://zookeeper.apache.org/ "Zookeeper website") is a centralized service for maintaining
configuration information, naming, providing distributed synchronization, and providing group
services. All participating nodes and clients need to be able to access the running ZooKeeper.
    +
    +DCS is a collection of components:
    +
    +* **ODBC/JDBC Drivers**: Provide a standard programming language middle-ware API for
accessing database management systems (DBMS).
    +* **DCS Master Process**: The DCS Master server is responsible for monitoring all server
instances in the cluster. It assigns an ODBC/JDBC client connection request to a Master Executor
(MXOSRVR) process. It also has a backup process that takes over the Master Executor role during
failures.
    +* **DCS Server Process**: This process is responsible for starting and keeping a Master
Executor (MXOSRVR) server process executing. There is one DCS Server process per node in the
cluster.
    +* **Master Executor Process**: This is the database server that provides database access
to ODBC/JDBC clients. There is a one-to-one relationship between an ODBC/JDBC client connection
and a database server process. The Master Executor performs all SQL queries on behalf of its
client's requests. It will perform all required SQL calls to execute a SQL query through the
Executor to access HBase tables. The Master Executor is often referred to as MXOSRVR.
    +
    +## Transaction Subsystem ##
    +
    +Trafodion supports distributed ACID transaction semantics using the Multi-Version Concurrency
Control (MVCC) model. The transaction management is built on top of a fork of the *HBase-trx*
project implementing the following changes:
    +
    +* Upgraded it to work on HBase version 0.98.1 (for CDH 5.1) or 0.98.0 (for HDP 2.1).
    +* Added support for parallel worker processes doing work on behalf of the same transaction.
    +* Added support for global transactions, that is, transactions that can encompass resources
(regions/HTables) across an HBase cluster.
    +* Added transaction recovery after server failure.
    +
    +There is on Distributed Transaction Manager (DTM) process per node in a cluster running
Trafodion. The DTM process owns and keeps track of all transactions that were started on that
node. (In HBase-trx, transactions were tracked in the library code of each client, which meant
that after a server failure, there was no way to restart the transaction manager for in-doubt
transactions.)
    +
    +When a Trafodion client begins a SQL statement, it checks in with the Transaction Manager
(TM) to begin the transaction. The TM returns a cluster-unique transaction ID. This transaction
ID in turn is propagated by the Trafodion Executor to any processes that work on some fragment
of that SQL statement. This transaction ID propagation occurs courtesy of a Trafodion messaging
layer, which keeps track, for example, of whether a process death has occurred.
    +
    +When a Trafodion Executor process issues an HBase call, the modified client-side HBase-trx
library can deduce which TM owns the transaction from the transaction ID, and registers itself
with that TM if it has not already done so. Thus, at any given moment in time, a TM is aware
of what processes are participating in a transaction.
    +
    +The original HBase-trx library worked by extending certain Java classes in the region
server. Our implementation has for the most part changed to execute this library in co-processors.
This allows better extensibility at the HBase level. With a class extension approach, only
one feature could extend the HBase code. With co-processors, it is possible to host several
extensions. Endpoint and observer co-processors perform the resource manager role in transaction
processing.
    +
    +For additional details, please refer to the [Trafodion Distributed Transaction Management]
(presentations/dtm-architecture.pdf) presentation.
    +
    +## Compiler Architecture ##
    +
    +The Trafodion Compiler translates SQL statements into query plans that can then be executed
by the Trafodion execution engine, commonly called the Executor.
    +
    +The Compiler is a multi-pass compiler. Each pass transforms a representation of the SQL
statement into a new or augmented representation which is input to the next pass. The sections
below give more detail on each pass. The logic that calls each pass is in the CmpMain class,
method CmpMain::compile. You can find that logic in file $MY_SQROOT/sql/sqlcomp/CmpMain.cpp.
    +
    +A copy of the compiler code runs in the Master process, which avoids inter-process message
passing between the Compiler and Executor. At the moment the compiler code is not re-entrant,
but it is a serially reusable resource within the Master. Some processing is recursive. For
example, the execution logic for DDL statements is packaged with the compiler code. When we
execute a DDL statement, the Executor spawns a separate Compiler process to execute that logic.
For another example, the UPDATE STATISTICS utility dynamically generates SQL SELECT statements
to obtain statistical data. Since we are not re-entrant, we spawn a separate Compiler process
for this recursive processing.
    +
    +The compiler is written in C++.
    +
    +### Parser ###
    +
    +The parser pass performs lexical and syntactic analysis, transforming the SQL statement
into a parse tree. Trafodion uses a hand-coded scanner for lexical analysis of UCS2 strings.
(UTF-8 encoding for SQL statement text is support but is translated to UCS2 internally). 
    +
    +The parser grammar is implemented as a set of Bison rules. The output of the parser is
a tree of objects of class RelExpr, representing relational operators. Scalar expressions
are represented by trees of ItemExpr objects, which are contained in the nodes of the RelExpr
tree. This common model to represent a query is used throughout the compilation process.
    +
    +### Binder ###
    +
    +The binder pass takes the parse tree and decorates it with metadata information. All
references to SQL objects (tables, views, columns and so on) are bound to their respective
metadata. The binder also performs type synthesis. At this stage, errors such as the wrong
data type being passed to a function call or that a column reference doesn't belong to any
of the tables in scope are detected.
    +
    +The binder also manages a cache of query plans. If the binder detects that the new SQL
statement is similar to one previously compiled, it simply reuses the earlier query plan (modifying
parameters as needed), bypassing subsequent compiler passes. This can be significant as optimization
is often the most expensive compilation phase.
    +
    +### Normalizer ###
    +
    +The SQL language is rich in redundancy. Many concepts can be expressed in multiple ways.
For example, sub-queries can be expressed as joins. The DISTINCT clause can be transformed
into GROUP BY. The normalizer pass removes this redundancy, transforming the parse tree into
a normalized representation in the following steps.
    +
    +* **Predicate Pull-Up**: Predicates are pulled up the tree as high as is semantically
possible. Then equivalence classes (Value Equivalence Groups or VEGs) are created for columns
and values that are subject to equality predicates. References to such columns and values
are then replaced with a reference to the equivalence class (VEG). Similarly equality predicates
themselves are replaced with a predicate that simply points to the equivalence class. Predicate
pull-up is how we achieve transitive closure.
    +* **Normalization**: Predicates are pushed back down again, performing some optimizations.
For example, if we have the query, select * from t1 join t2 on t1.a = t2.b where t1.a = 5,
we can infer the predicate t2.b = 5 and push that down into the t2 scan operator.
    +* **Semantic Query Optimization**: We perform unconditional transformations that depend
on uniqueness or cardinality constraints.
    +
    +### Optimizer ###
    +The Trafodion optimizer is a rule-based, cost-driven optimizer based on the Cascades
Framework. By "rule-based", we mean that plan transformation is based on a set of rules coded
within the Optimizer. (We don't mean syntax-driven optimization based on hints in the SQL
statement text.) By "cost-driven", we mean that cost estimates are used to bound the search
space.
    +
    +It is a top-down optimizer; that is, it generates an initial feasible plan for the query,
then using rules, transforms that plan into semantically equivalent alternatives. The optimizer
computes the cost of each plan, and uses these costs to bound its search for additional plans
using a branch-and-bound algorithm. This is in contrast to classical, dynamic programming-style
optimizers, that build up a set of plans "bottom-up", by first considering all one-table plans,
then joins of two tables, then joins of three tables and so on.
    +
    +The optimizer itself is multiple passes, some of which can be bypassed depending on the
optimization level chosen for the compile. The first pass simply generates the initial feasible
plan. Subsequent passes apply successively richer sets of rules to traverse the search space.
For example, we first consider only hash joins, and in later passes introduce the possibility
of nested or merge joins.
    +
    +The optimizer makes a distinction between logical and physical expressions. A logical
expression considers the semantics of an operator, for example, a join. Certain aspects of
a plan pertain to logical expressions, for example estimated output row count. A physical
expression considers the implementation of an operator, for example, a nested join. Certain
aspects of a plan pertain to physical expressions, for example estimated message counts. Rules
transform logical expressions into other logical expressions or into physical expressions.
So, for example, join order would be permuted at the logical expression level, then join method
considered as we implement the operator with a physical expression.
    +
    +Search spaces in general are exponential in size. So the optimizer is rich in heuristics
to limit where it searches. The optimizer also takes into account variation: Estimations of
cost for individual relational operators will be imperfect; the optimizer tries to pick plans
that degrade gracefully if estimates are off the mark.
    +
    +Another factor the optimizer takes into account is that traversal can wrap back to a
previously visited plan. The optimizer remembers plans previously visited in a "memo" structure
(class CascadesMemo). Plans are hashed for quick lookup.
    +
    +### Pre-Code Generator ###
    +
    +The pre-code generator performs unconditional transformations after the optimization
phase. References to elements of an equivalence class are replaced with the most efficient
alternatives. 
    +
    +For example, an equivalence class (VEG) containing { T1.A, T2.B, 5 }, in the context
of a T2 scan operator results in the predicate T2.B = 5.
    +
    +### Generator ###
    +
    +The generator pass transforms the chosen optimized tree into a query plan which can then
be executed by the Executor. Low-level optimizations of scalar expressions take place here.
Many scalar expressions are generated in native machine code using the open source LLVM infrastructure.
For those scalar operators where we have not yet implemented native expression support, we
instead generate code that is interpreted at run time.
    +
    +### Heap Management ###
    +
    +In order to make heap management efficient, the Compiler uses heap classes, NAHeap, that
it shares with the executor. One heap, the Statement heap, is used for objects that are particular
to a given SQL statement's compilation, for example, parse tree nodes. At the end of statement
compilation, we simply destroy the heap instead of calling "delete" on each of possibly thousands
of objects. Another heap, the Context heap, is used for objects that may be reused across
SQL statements. For example, metadata is cached within the compiler. As one can imagine, considerable
care goes into selecting which heap to use when creating a given object, to avoid dangling
references and other resource leaks. For example, access to a given file must be encapsulated
in an object on the global heap, since on the statement heap we cannot count on execution
of the destructor to close the file.
    +
    +### Error Management ###
    +
    +The Compiler captures error information into a ComDiagsArea object. The style of programming
is to return on errors rather than throw exceptions. Calling logic then checks for the presence
of errors before continuing. So, for example, the main logic that invokes each compiler pass
checks for errors before proceeding to the next pass.
    +
    +## Executor Architecture ##
    +
    +The Trafodion Executor implements a data-flow architecture. That is, each relational
operator is implemented as a set of tasks which are scheduled for execution. Operators communicate
with each other using queues.
    +
    +### Relational Operators ###
    +
    +A query plan consists of a collection of fragments, each fragment being a portion of
the query plan executed in a given process. Each fragment in turn is a tree of relational
operators. A relational operator may in turn be decorated with additional scalar expressions.
Relational operators in the query plan are represented by two class hierarchies, ex_tdb and
ex_tcb. The ex_tdb (tdb = "task descriptor block") hierarchy contains the compiler-generated
state for the operator. The ex_tcb (tcb = "task control block") hierarchy contains the run-time
state for the operator. So, for example, the queue objects are pointed to by ex_tcb objects.
    +
    +### Scalar Expressions ###
    +
    +Scalar expressions are evaluated by an expression evaluator. If the expression could
be compiled into native machine code, the expression evaluator simply invokes this code. Otherwise,
the expression evaluator implements an interpreter. For historical reasons, there are actually
two interpreters. The first (and oldest) is a high level clause-based expression evaluator:
each clause roughly corresponds to a scalar operator in the original SQL text. The second
(and newest) is a PCODE-based evaluator, implementing a lower-level machine-like instruction
set. Most expressions that cannot be generated as native machine code are generated as PCODE;
those few expressions that PCODE cannot cover are generated as clause expressions. For debugging
purposes, it is possible to force the Compiler to generate PCODE instead of native machine
code, or clause-based expressions instead of either native machine code or PCODE.
    +
    +### Interprocess Communication ###
    +
    +An IPC layer, shared with other components like the compiler, abstracts the (asynchronous)
communication of objects across process boundaries. The sorts of things that flow are query
plan objects, data rows, and error objects (ComDiagsArea).
    --- End diff --
    
    "shared with other components like the compiler"... consider "such as" instead of "like"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message