trafodion-codereview mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DaveBirdsall <...@git.apache.org>
Subject [GitHub] trafodion pull request #1730: [TRAFODION-3223] Don't scale down for non-Puts...
Date Wed, 17 Oct 2018 22:13:13 GMT
GitHub user DaveBirdsall opened a pull request:

    https://github.com/apache/trafodion/pull/1730

    [TRAFODION-3223] Don't scale down for non-Puts when estimating row counts

    The estimateRowCount code in HBaseClient.java tried to scale down row counts by the proportion
of non-Put cells in the file. That is, it was trying to estimate row count from cell count,
in part by discounting the effect of Delete tombstone cells. It was doing this on the basis
of a sample of 500 rows in one HFile.
    
    We find, however, that with time-ordered data that is aged out, the Delete cells are not
uniformly distributed but instead tend to clump in one place. If we are unlucky and get an
HFile that begins with 500 Delete tombstones, we will incorrectly assume most of the table
consists of deleted rows and drastically underestimate the number of rows.
    
    Drastically underestimating can be very bad. It is much better to overestimate. So the
code that attempted to scale down row count based on the number of non-Put cells has been
deleted. Also, if we find that the number of Puts in our sample is very small (< 50), we
will instead ignore the sample and use the total number of entries.
    
    The changes described above are in HBaseClient.java.
    
    There are two other small, unrelated changes in this pull request as well:
    
    1. The regression test filter for filtering out SYSKEYS has been changed. The current
minimum number of decimal digits in a SYSKEY is 15; the filter was assuming they were at least
16 digits. This would lead to regression failures if someone was very unlucky and got just
the wrong Linux thread ID for their process.
    
    2. An uninitialized member of class ExRtFragTable is now initialized. This is a long-standing
bug; the changes for pull request https://github.com/apache/trafodion/pull/1724 made it observable.
For random parallel queries, the Executor GUI might come up at run time if the uninitialized
value happened to be non-zero.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DaveBirdsall/trafodion Trafodion3223

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/trafodion/pull/1730.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1730
    
----
commit 898812f84c510ab8798d5af6e3e63559f4078a07
Author: Dave Birdsall <dbirdsall@...>
Date:   2018-10-17T22:06:44Z

    [TRAFODION-3223] Don't scale down for non-Puts when estimating row counts

----


---

Mime
View raw message