spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raviteja Lokineni <raviteja.lokin...@gmail.com>
Subject Re: Aggregations on every column on dataframe causing StackOverflowError
Date Wed, 09 Nov 2016 18:06:27 GMT
Does this stacktrace look like a bug guys? Definitely seems like one to me.

Caused by: java.lang.StackOverflowError
	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at scala.collection.immutable.List.foreach(List.scala:381)


On Wed, Nov 9, 2016 at 10:48 AM, Raviteja Lokineni <
raviteja.lokineni@gmail.com> wrote:

> Hi all,
>
> I am not sure if this is a bug or not. Basically I am generating weekly
> aggregates of every column of data.
>
> Adding source code here (also attached):
>
> from pyspark.sql.window import Window
> from pyspark.sql.functions import *
>
> timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
>
> # Hive timestamp is interpreted as UNIX timestamp in seconds*
> days = lambda i: i * 86400
>
> w = (Window()
>      .partitionBy("id")
>      .orderBy(col("dt").cast("timestamp").cast("long"))
>      .rangeBetween(-days(6), 0))
>
> cols = ["id", "dt"]
> skipCols = ["id", "dt"]
>
> for col in timeSeries.columns:
>     if col in skipCols:
>         continue
>     cols.append(mean(col).over(w).alias("mean_7_"+col))
>     cols.append(count(col).over(w).alias("count_7_"+col))
>     cols.append(sum(col).over(w).alias("sum_7_"+col))
>     cols.append(min(col).over(w).alias("min_7_"+col))
>     cols.append(max(col).over(w).alias("max_7_"+col))
>
> df = timeSeries.select(cols)
> df.orderBy('id', 'dt').write\
>     .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
>     .save("file:///tmp/spark-bug-out.csv")
>
>
> Thanks,
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokineni@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> <http://in.linkedin.com/in/ravitejalokineni>
>
>


-- 
*Raviteja Lokineni* | Business Intelligence Developer
TD Ameritrade

E: raviteja.lokineni@gmail.com

[image: View Raviteja Lokineni's profile on LinkedIn]
<http://in.linkedin.com/in/ravitejalokineni>

Mime
View raw message