Julian Hyde created CALCITE1616:

Summary: Data profiler
Key: CALCITE1616
URL: https://issues.apache.org/jira/browse/CALCITE1616
Project: Calcite
Issue Type: Bug
Reporter: Julian Hyde
Assignee: Julian Hyde
Profiling looks at a data set and infers characteristics and constraints about the data.
Some applications:
* it helps users understand their data,
* inferred constraints may allow additional optimizations (e.g. a foreign key allows semijoin
removal),
* column statistics help the optimizer estimate the selectivity of filters and joins,
* joint cardinalities drive the algorithm that chooses which tiles of a lattice to materialize.
Here is some sample output:
{noformat}
{"type": "distribution", "columns": [], "cardinality": 1000000)},
{"type": "distribution", "columns": ["gender"], "cardinality": 2, nullCount: 0, values: ["F",
"M"])},
{"type": "distribution", "columns": ["state"], "cardinality": 50)},
{"type": "distribution", "columns": ["zipcode"], "cardinality": 43000, "nullCount": 0)},
{"type": "distribution", "columns": ["state", "zipcode"], "cardinality": 43000)},
{"type": "unique", "columns": ["orderId"]},
{"type": "fd", "columns": ["productId"], "depend": ["brand", "productName"]},
{noformat}
Note:
* the cardinality of 0 columns is the count of the data set;
* "nullCount" and "values" are only present for distributions of 1 column;
* "nullCount" may be is omitted if 0;
* "values" is only present if there are N or fewer values
* "distribution" of 2 or more columns is only output if it is "interesting"; in the case of
["state", "zipcode"] it is interesting because the joint cardinality 43,400 is fewer than
the cardinality 999,982 that would be expected if they were uniform and independent;
* "fd" and "unique" are minimal. For example, we don't output unique(orderId, productId) if
we have unique(orderId).
Other ideas:
* Some measure of skewedness. Does one value occur many more times than others?
* Don't compute joint distributions for the power set of columns. This requires memory exponential
in the number of columns. In pass 1 compute singlecolumn distributions. In pass N compute
Ncolumn distributions only for the combinations of columns that the previous pass indicates
will be interesting.
* Use HyperLogLog to compute cardinalities.
* Add lowcardinality columns to joint distributions. Rather than computing cardinality(zipcode,
state) compute cardinality(zipcode, state, gender="F") and cardinality(zipcode, state, gender="M").
Because HLL rolls up losslessly, with 2x the memory you can compute 3 results: cardinality(zipcode,
state), cardinality(zipcode, gender), cardinality(state, gender).
* Approximate histograms: approximate quartiles? Or buckets with exact counts in each range?
* Allow passing previous results into the algorithm. If you know the previous histogram of
the orderDate column it is easier to compute its new histogram than if you start from scratch.
* HyperLogLog is inaccurate for low cardinalities, so keep all values until the number of
values exceeds a threshold, then transition to buckets.

This message was sent by Atlassian JIRA
(v6.3.15#6346)
