From issues-return-135037-apmail-spark-issues-archive=spark.apache.org@spark.apache.org Wed Oct 19 03:11:59 2016
Return-Path:
X-Original-To: apmail-spark-issues-archive@minotaur.apache.org
Delivered-To: apmail-spark-issues-archive@minotaur.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id A99921967F
for ; Wed, 19 Oct 2016 03:11:59 +0000 (UTC)
Received: (qmail 3753 invoked by uid 500); 19 Oct 2016 03:11:59 -0000
Delivered-To: apmail-spark-issues-archive@spark.apache.org
Received: (qmail 3508 invoked by uid 500); 19 Oct 2016 03:11:59 -0000
Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Delivered-To: mailing list issues@spark.apache.org
Received: (qmail 3391 invoked by uid 99); 19 Oct 2016 03:11:59 -0000
Received: from arcas.apache.org (HELO arcas) (140.211.11.28)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2016 03:11:59 +0000
Received: from arcas.apache.org (localhost [127.0.0.1])
by arcas (Postfix) with ESMTP id 1CB262C4C72
for ; Wed, 19 Oct 2016 03:11:59 +0000 (UTC)
Date: Wed, 19 Oct 2016 03:11:59 +0000 (UTC)
From: "Zhenhua Wang (JIRA)"
To: issues@spark.apache.org
Message-ID:
In-Reply-To:
References:
Subject: [jira] [Updated] (SPARK-17074) generate histogram information for
column
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394
[ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhenhua Wang updated SPARK-17074:
---------------------------------
Description:
We support two kinds of histograms:
- Equi-width histogram: We have a fixed width for each column interval in the histogram. The height of a histogram represents the frequency for those column values in a specific interval. For this kind of histogram, its height varies for different column intervals. We use the equi-width histogram when the number of distinct values is less than 254.
- Equi-height histogram: For this histogram, the width of column interval varies. The heights of all column intervals are the same. The equi-height histogram is effective in handling skewed data distribution. We use the equi- height histogram when the number of distinct values is equal to or greater than 254.
We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms (for both numeric and string types) or endpoints of equi-height histograms (for numeric type only). Then, if we get endpoints of a equi-height histogram, we need to compute ndv's between those endpoints by [SPARK-17997] to form the equi-height histogram.
This Jira incorporates three Jiras mentioned above to support needed aggregation functions. We need to resolve them before this one.
was:
We support two kinds of histograms:
- Equi-width histogram: We have a fixed width for each column interval in the histogram. The height of a histogram represents the frequency for those column values in a specific interval. For this kind of histogram, its height varies for different column intervals. We use the equi-width histogram when the number of distinct values is less than 254.
- Equi-height histogram: For this histogram, the width of column interval varies. The heights of all column intervals are the same. The equi-height histogram is effective in handling skewed data distribution. We use the equi- height histogram when the number of distinct values is equal to or greater than 254.
> generate histogram information for column
> -----------------------------------------
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
> Issue Type: Sub-task
> Components: Optimizer
> Affects Versions: 2.0.0
> Reporter: Ron Hu
>
> We support two kinds of histograms:
> - Equi-width histogram: We have a fixed width for each column interval in the histogram. The height of a histogram represents the frequency for those column values in a specific interval. For this kind of histogram, its height varies for different column intervals. We use the equi-width histogram when the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval varies. The heights of all column intervals are the same. The equi-height histogram is effective in handling skewed data distribution. We use the equi- height histogram when the number of distinct values is equal to or greater than 254.
> We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms (for both numeric and string types) or endpoints of equi-height histograms (for numeric type only). Then, if we get endpoints of a equi-height histogram, we need to compute ndv's between those endpoints by [SPARK-17997] to form the equi-height histogram.
> This Jira incorporates three Jiras mentioned above to support needed aggregation functions. We need to resolve them before this one.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org