spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan-Paul Bultmann <>
Subject generateTreeString causes huge performance problems on dataframe persistence
Date Wed, 17 Jun 2015 10:17:10 GMT
I noticed that my code spends hours with `generateTreeString` even though the actual dag/dataframe
execution takes seconds.

I’m running a query that grows exponential in the number of iterations when evaluated without
but should be linear when caching previous results.


    result_i+1 = distinct(join(result_i, result_i))

Which evaluates exponentially like this this without caching.

Iteration | Dataframe Plan Tree
0            |        /\
1            |     /\    /\
2            |    /\/\  /\/\
n            |    ……….

But should be linear with caching.

Iteration | Dataframe Plan Tree
0            |     /\
              |     \/
1            |     /\
              |     \/
2            |     /\
              |     \/
n            | ……….

It seems that even though the DAG will have the later form, `generateTreeString` will walk
the entire plan naively as if no caching was done.

The spark webui also shows no active jobs even though my CPU uses one core fully, calculating
that string.

Below is the piece of stacktrace that starts the entire walk.

Thousands of calls to  `generateTreeString`.
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(int, StringBuilder) TreeNode.scala:431
org.apache.spark.sql.catalyst.trees.TreeNode.treeString() TreeNode.scala:400
org.apache.spark.sql.catalyst.trees.TreeNode.toString() TreeNode.scala:397
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply() InMemoryColumnarTableScan.scala:164
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply() InMemoryColumnarTableScan.scala:164
scala.Option.getOrElse(Function0) Option.scala:120
org.apache.spark.sql.columnar.InMemoryRelation.buildBuffers() InMemoryColumnarTableScan.scala:164
org.apache.spark.sql.columnar.InMemoryRelation.<init>(Seq, boolean, int, StorageLevel,
SparkPlan, Option, RDD, Statistics, Accumulable) InMemoryColumnarTableScan.scala:112
org.apache.spark.sql.columnar.InMemoryRelation$.apply(boolean, int, StorageLevel, SparkPlan,
Option) InMemoryColumnarTableScan.scala:45
org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply() CacheManager.scala:102
org.apache.spark.sql.execution.CacheManager.writeLock(Function0) CacheManager.scala:70
org.apache.spark.sql.execution.CacheManager.cacheQuery(DataFrame, Option, StorageLevel) CacheManager.scala:94
org.apache.spark.sql.DataFrame.persist(StorageLevel) DataFrame.scala:1320
Application logic.

Could someone confirm my suspicion?
And does somebody know why it’s called while caching, and why it walks the entire tree including
cached results?

Cheers, Jan-Paul
To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message