spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-13749) Faster pivot implementation for many distinct values with two phase aggregation
Date Tue, 03 May 2016 05:49:12 GMT

     [ https://issues.apache.org/jira/browse/SPARK-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yin Huai resolved SPARK-13749.
------------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.0

Issue resolved by pull request 12861
[https://github.com/apache/spark/pull/12861]

> Faster pivot implementation for many distinct values with two phase aggregation
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-13749
>                 URL: https://issues.apache.org/jira/browse/SPARK-13749
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Andrew Ray
>            Assignee: Andrew Ray
>             Fix For: 2.0.0
>
>
> The existing implementation of pivot translates into a single aggregation with one aggregate
per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this
can get extremely slow since each input value gets evaluated on every aggregate even though
it only affects the value of one of them.
> I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold)
distinct pivot values. We do two phases of aggregation. In the first we group by the grouping
columns plus the pivot column and perform the specified aggregations (one or sometimes more).
In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst
aggregate that rearranges the outputs of the first aggregation into an array indexed by the
pivot value. Finally we do a project to extract the array entries into the appropriate output
column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message