spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herman van Hovell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17074) generate histogram information for column
Date Sat, 22 Oct 2016 13:26:58 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15597858#comment-15597858
] 

Herman van Hovell commented on SPARK-17074:
-------------------------------------------

[~ZenWzh] I think your current approach is valid. I will take two passes, but that is fine
for now.

I have discussed this with Tim and we are going to see if we can come up with something for
a single pass algorithm. But that is going to be somewhere in the next week.

Please also note that we are currently doing some work on the aggregation code paths. This
might make your effort a little easier.

> generate histogram information for column
> -----------------------------------------
>
>                 Key: SPARK-17074
>                 URL: https://issues.apache.org/jira/browse/SPARK-17074
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Optimizer
>    Affects Versions: 2.0.0
>            Reporter: Ron Hu
>
> We support two kinds of histograms: 
> -	Equi-width histogram: We have a fixed width for each column interval in the histogram.
 The height of a histogram represents the frequency for those column values in a specific
interval.  For this kind of histogram, its height varies for different column intervals. We
use the equi-width histogram when the number of distinct values is less than 254.
> -	Equi-height histogram: For this histogram, the width of column interval varies.  The
heights of all column intervals are the same.  The equi-height histogram is effective in handling
skewed data distribution. We use the equi- height histogram when the number of distinct values
is equal to or greater than 254.  
> We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms (for both
numeric and string types) or endpoints of equi-height histograms (for numeric type only).
Then, if we get endpoints of a equi-height histogram, we need to compute ndv's between those
endpoints by [SPARK-17997] to form the equi-height histogram.
> This Jira incorporates three Jiras mentioned above to support needed aggregation functions.
We need to resolve them before this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message