spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominic Ricard (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function
Date Tue, 07 Feb 2017 15:48:42 GMT

    [ https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856229#comment-15856229
] 

Dominic Ricard commented on SPARK-12835:
----------------------------------------

I ran into a the same issue today when trying to do:

_SQL_
{noformat}
select
  sum(row_number() OVER (partition by column1 order by column2))
from
  (select 1 as `column1`, 1 as `column2`) t
{noformat}

The actual use case is a bit more evolved than this but this is the minimum required to reproduce
the issue.

The workaround I found was to calculate the row_number in a subselect then do the sum in the
select:
{noformat}
select sum (tt.row_num) from 
(select
  row_number() OVER (partition by column1 order by column2) as `row_num`
from
  (select 1 as `column1`, 1 as `column2`) t) tt
{noformat}


> StackOverflowError when aggregating over column from window function
> --------------------------------------------------------------------
>
>                 Key: SPARK-12835
>                 URL: https://issues.apache.org/jira/browse/SPARK-12835
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.0
>            Reporter: Kalle Jepsen
>
> I am encountering a StackoverflowError with a very long traceback, when I try to directly
aggregate on a column created by a window function.
> E.g. I am trying to determine the average timespan between dates in a Dataframe column
by using a window-function:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import HiveContext, Window, functions
> from datetime import datetime
> sc = SparkContext()
> sq = HiveContext(sc)
> data = [
>     [datetime(2014,1,1)],
>     [datetime(2014,2,1)],
>     [datetime(2014,3,1)],
>     [datetime(2014,3,6)],
>     [datetime(2014,8,23)],
>     [datetime(2014,10,1)],
> ]
> df = sq.createDataFrame(data, schema=['ts'])
> ts = functions.col('ts')
>    
> w = Window.orderBy(ts)
> diff = functions.datediff(
>     ts,
>     functions.lag(ts, count=1).over(w)
> )
> avg_diff = functions.avg(diff)
> {code}
> While {{df.select(diff.alias('diff')).show()}} correctly renders as
> {noformat}
>     +----+
>     |diff|
>     +----+
>     |null|
>     |  31|
>     |  28|
>     |   5|
>     | 170|
>     |  39|
>     +----+
> {noformat}
> doing {code}
> df.select(avg_diff).show()
> {code} throws a {{java.lang.StackOverflowError}}.
> When I say
> {code}
> df2 = df.select(diff.alias('diff'))
> df2.select(functions.avg('diff'))
> {code}
> however, there's no error.
> Am I wrong to assume that the above should work?
> I've already described the same in [this question on stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message