[ https://issues.apache.org/jira/browse/SPARK17074?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Zhenhua Wang updated SPARK17074:

Description:
We support two kinds of histograms:
 Equiwidth histogram: We have a fixed width for each column interval in the histogram.
The height of a histogram represents the frequency for those column values in a specific interval.
For this kind of histogram, its height varies for different column intervals. We use the
equiwidth histogram when the number of distinct values is less than 254.
 Equiheight histogram: For this histogram, the width of column interval varies. The heights
of all column intervals are the same. The equiheight histogram is effective in handling
skewed data distribution. We use the equi height histogram when the number of distinct values
is equal to or greater than 254.
We first use [SPARK18000] and [SPARK17881] to compute equiwidth histograms (for both numeric
and string types) or endpoints of equiheight histograms (for numeric type only). Then, if
we get endpoints of a equiheight histogram, we need to compute ndv's between those endpoints
by [SPARK17997] to form the equiheight histogram.
This Jira incorporates three Jiras mentioned above to support needed aggregation functions.
We need to resolve them before this one.
was:
We support two kinds of histograms:
 Equiwidth histogram: We have a fixed width for each column interval in the histogram.
The height of a histogram represents the frequency for those column values in a specific interval.
For this kind of histogram, its height varies for different column intervals. We use the
equiwidth histogram when the number of distinct values is less than 254.
 Equiheight histogram: For this histogram, the width of column interval varies. The heights
of all column intervals are the same. The equiheight histogram is effective in handling
skewed data distribution. We use the equi height histogram when the number of distinct values
is equal to or greater than 254.
> generate histogram information for column
> 
>
> Key: SPARK17074
> URL: https://issues.apache.org/jira/browse/SPARK17074
> Project: Spark
> Issue Type: Subtask
> Components: Optimizer
> Affects Versions: 2.0.0
> Reporter: Ron Hu
>
> We support two kinds of histograms:
>  Equiwidth histogram: We have a fixed width for each column interval in the histogram.
The height of a histogram represents the frequency for those column values in a specific
interval. For this kind of histogram, its height varies for different column intervals. We
use the equiwidth histogram when the number of distinct values is less than 254.
>  Equiheight histogram: For this histogram, the width of column interval varies. The
heights of all column intervals are the same. The equiheight histogram is effective in handling
skewed data distribution. We use the equi height histogram when the number of distinct values
is equal to or greater than 254.
> We first use [SPARK18000] and [SPARK17881] to compute equiwidth histograms (for both
numeric and string types) or endpoints of equiheight histograms (for numeric type only).
Then, if we get endpoints of a equiheight histogram, we need to compute ndv's between those
endpoints by [SPARK17997] to form the equiheight histogram.
> This Jira incorporates three Jiras mentioned above to support needed aggregation functions.
We need to resolve them before this one.

This message was sent by Atlassian JIRA
(v6.3.4#6332)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
