spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhenhua Wang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-18000) Aggregation function for computing endpoints for numeric histograms
Date Wed, 19 Oct 2016 02:53:58 GMT
Zhenhua Wang created SPARK-18000:
------------------------------------

             Summary: Aggregation function for computing endpoints for numeric histograms
                 Key: SPARK-18000
                 URL: https://issues.apache.org/jira/browse/SPARK-18000
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Zhenhua Wang


For a column of numeric type (including date and timestamp), we will generate a equi-width
or equi-height histogram, depending on if its ndv is large than the maximum number of bins
allowed in one histogram (denoted as numBins).
This agg function computes values and their frequencies using a small hashmap, whose size
is less than or equal to "numBins", and returns an equi-width histogram. 
When the size of hashmap exceeds "numBins", it cleans the hashmap and utilizes ApproximatePercentile
to return endpoints of equi-height histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message