[ https://issues.apache.org/jira/browse/SPARK17074?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=15570366#comment15570366
]
Zhenhua Wang edited comment on SPARK17074 at 10/22/16 12:52 PM:

Well, I've got stuck here for a few days. I went through the QuantileSummaries paper and our
code in Spark, and I still don't have any clue how to implement the second method and get
its bounds.
So I decide to adopt the first method for now, such that it won't block our progress on CBO
work. We can implement the other one in the future.
A PR for a new agg function for counting ndv's of multiple intervals is already sent.
was (Author: zenwzh):
Well, I've got stuck here for a few days. I went through the QuantileSummaries paper and our
code in Spark, and I still don't have any clue how to implement the second method and get
its bounds.
So I decide to adopt the first method for now, such that it won't block our progress on CBO
work. We can implement the other one in the future.
A PR for a new agg function for string histogram (equiwidth) is already sent. I'll start
to work on this one today and send a pr in the following days. Thanks!
> generate histogram information for column
> 
>
> Key: SPARK17074
> URL: https://issues.apache.org/jira/browse/SPARK17074
> Project: Spark
> Issue Type: Subtask
> Components: Optimizer
> Affects Versions: 2.0.0
> Reporter: Ron Hu
>
> We support two kinds of histograms:
>  Equiwidth histogram: We have a fixed width for each column interval in the histogram.
The height of a histogram represents the frequency for those column values in a specific
interval. For this kind of histogram, its height varies for different column intervals. We
use the equiwidth histogram when the number of distinct values is less than 254.
>  Equiheight histogram: For this histogram, the width of column interval varies. The
heights of all column intervals are the same. The equiheight histogram is effective in handling
skewed data distribution. We use the equi height histogram when the number of distinct values
is equal to or greater than 254.
> We first use [SPARK18000] and [SPARK17881] to compute equiwidth histograms (for both
numeric and string types) or endpoints of equiheight histograms (for numeric type only).
Then, if we get endpoints of a equiheight histogram, we need to compute ndv's between those
endpoints by [SPARK17997] to form the equiheight histogram.
> This Jira incorporates three Jiras mentioned above to support needed aggregation functions.
We need to resolve them before this one.

This message was sent by Atlassian JIRA
(v6.3.4#6332)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
