trafodion-codereview mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DaveBirdsall <...@git.apache.org>
Subject [GitHub] incubator-trafodion pull request #811: [TRAFODION-2322] Improve UPDATE STATS...
Date Wed, 02 Nov 2016 18:24:14 GMT
GitHub user DaveBirdsall opened a pull request:

    https://github.com/apache/incubator-trafodion/pull/811

    [TRAFODION-2322] Improve UPDATE STATS performance on long char columns

    This set of code changes addresses two issues:
    
    1.	UPDATE STATISTICS performance on long character columns can be very poor. A bit of
explanation: With string columns in Hive tables in particular, Trafodion by default maps these
to a VARCHAR(31999) data type. For large tables, UPDATE STATISTICS uses a SQL query to compute
groupings of values. This query can be very slow for large VARCHARs. Often the Hive string
columns contain just short strings, so this slow performance can be annoying. But the problem
isn’t just with Hive; a Trafodion table with a VARCHAR(31999) column that has just short
strings in the column suffers from the same malady.
    
    2.	Creating a persistent sample table on very long character columns (> 200000 bytes)
would fail. A bit of explanation: Trafodion supports longer strings on Hive tables than it
does for its own Trafodion tables. UPDATE STATISTICS sometimes creates sample tables. Today
the sample tables are always Trafodion tables, even though the original table is a Hive table.
JIRA TRAFODION-2251 fixed this problem for sample tables created on-the-fly. However for persistent
sample tables created using the UPDATE STATISTICS CREATE SAMPLE syntax, that fix did not work.
As it turns out, in order to fix the first issue above, I ended up reworking the JIRA TRAFODION-2251
fix, and in the process fixed this additional issue as well.
    
    Design notes:
    
    When UPDATE STATISTICS needs to process a long character column, it now truncates the
value to (by default) 256 bytes. This can enormously increase the performance of underlying
SQL queries used by UPDATE STATISTICS on such columns. The trade-off is that unique entry
count (UEC) may be underestimated. In particular, in the worst case, in a data set where all
strings begin with the same first 256 bytes, the UEC will be mis-estimated as one! The ‘256’
value is soft, and can be changed using CQD USTAT_MAX_CHAR_COL_LENGTH_IN_BYTES.
    
    As part of this processing, UPDATE STATISTICS also may create sample tables. When creating
these tables, UPDATE STATISTICS now truncates long character columns to the same 256 byte
limit.
    
    In JIRA Trafodion-2251, we employed a similar technique for sample tables, but it did
not work in the UPDATE STATISTICS CREATE SAMPLE code path. That has been fixed (ustat/hs_yacc.y).
    
    Also, the technique used in JIRA Trafodion-2251 was to use CREATE TABLE AS SELECT in order
to get the shorter columns. This is problematic for use with Trafodion tables, as partitioning
is not preserved. That hurts the performance of sample table population and querying. Extending
this to allow specifying partitioning appeared to be hard (because in principle, the SELECT
can be *any* query). So instead, we now use a different technique. We have extended CREATE
TABLE LIKE to take a new LIMIT COLUMN LENGTH TO n clause. When present, CREATE TABLE LIKE
will limit any character column to n bytes. One nuance to this is that if a truncated column
would form part of the primary key, the primary key constraint is replaced by a clustering
key (STORE BY). We must do this since truncating values may affect their uniqueness properties.
    
    A few other small changes are present. The debug routines in NAMemory.h have been slightly
changed and made public to simply debugging buffer overrun issues. Also, a clear() method
has been added to NAString because I got tired of trying to figure out the NAString equivalent
of the STL string clear method. So, NAString becomes just a tiny bit more like STL. Regression
test compGeneral/TEST023 was enhanced to exercise code paths where long character columns
are truncated. Test seabase/TEST040 was extended to include testing of CREATE TABLE LIKE LIMIT
COLUMN LENGTH.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DaveBirdsall/incubator-trafodion Trafodion2322a

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-trafodion/pull/811.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #811
    
----
commit dfaaa41b044980ddf62c74b8e8778b29e3f267eb
Author: Dave Birdsall <dbirdsall@apache.org>
Date:   2016-11-02T18:20:20Z

    [TRAFODION-2322] Improve UPDATE STATS performance on long char columns

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message