spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan-Paul Bultmann <janpaulbultm...@me.com>
Subject Re: generateTreeString causes huge performance problems on dataframe persistence
Date Wed, 17 Jun 2015 16:36:51 GMT

> Seems you're hitting the self-join, currently Spark SQL won't cache any result/logical
tree for further analyzing or computing for self-join.

Other joins don’t suffer from this problem?

> Since the logical tree is huge, it's reasonable to take long time in generating its tree
string recursively.

The internal structure is basically a graph though, right?
Where equal cached subtrees are structurally shared by reference instead of copying them by
value.

Is the `generateTreeString` result needed for anything other than giving the RDD a nice name?
It seems rather wasteful to compute a graphs unfolding into a tree for this.

> And I also doubt the computing can finish within a reasonable time, as there probably
be lots of partitions (grows exponentially) of the intermediate result.
> 

Possibly, so far the number of partitions stayed the same though.
But I didn’t run that many iterations due to the problem :).

> As a workaround, you can break the iterations into smaller ones and trigger them manually
in sequence.

You mean` write` ing them to disk after each iteration?

Thanks :), Jan

> -----Original Message-----
> From: Jan-Paul Bultmann [mailto:janpaulbultmann@me.com] 
> Sent: Wednesday, June 17, 2015 6:17 PM
> To: User
> Subject: generateTreeString causes huge performance problems on dataframe persistence
> 
> Hey,
> I noticed that my code spends hours with `generateTreeString` even though the actual
dag/dataframe execution takes seconds.
> 
> I’m running a query that grows exponential in the number of iterations when evaluated
without caching, but should be linear when caching previous results.
> 
> E.g.
> 
>    result_i+1 = distinct(join(result_i, result_i))
> 
> Which evaluates exponentially like this this without caching.
> 
> Iteration | Dataframe Plan Tree
> 0            |        /\
> 1            |     /\    /\
> 2            |    /\/\  /\/\
> n            |    ……….
> 
> But should be linear with caching.
> 
> Iteration | Dataframe Plan Tree
> 0            |     /\
>              |     \/
> 1            |     /\
>              |     \/
> 2            |     /\
>              |     \/
> n            | ……….
> 
> 
> It seems that even though the DAG will have the later form, `generateTreeString` will
walk the entire plan naively as if no caching was done.
> 
> The spark webui also shows no active jobs even though my CPU uses one core fully, calculating
that string.
> 
> Below is the piece of stacktrace that starts the entire walk.
> 
> ^
> |
> Thousands of calls to  `generateTreeString`.
> |
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(int, StringBuilder) TreeNode.scala:431
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString() TreeNode.scala:400
> org.apache.spark.sql.catalyst.trees.TreeNode.toString() TreeNode.scala:397
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply() InMemoryColumnarTableScan.scala:164
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.apply() InMemoryColumnarTableScan.scala:164
> scala.Option.getOrElse(Function0) Option.scala:120
> org.apache.spark.sql.columnar.InMemoryRelation.buildBuffers() InMemoryColumnarTableScan.scala:164
> org.apache.spark.sql.columnar.InMemoryRelation.<init>(Seq, boolean, int, StorageLevel,
SparkPlan, Option, RDD, Statistics, Accumulable) InMemoryColumnarTableScan.scala:112
> org.apache.spark.sql.columnar.InMemoryRelation$.apply(boolean, int, StorageLevel, SparkPlan,
Option) InMemoryColumnarTableScan.scala:45
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply() CacheManager.scala:102
> org.apache.spark.sql.execution.CacheManager.writeLock(Function0) CacheManager.scala:70
org.apache.spark.sql.execution.CacheManager.cacheQuery(DataFrame, Option, StorageLevel) CacheManager.scala:94
> org.apache.spark.sql.DataFrame.persist(StorageLevel) DataFrame.scala:1320 ^
> |
> Application logic.
> |
> 
> Could someone confirm my suspicion?
> And does somebody know why it’s called while caching, and why it walks the entire tree
including cached results?
> 
> Cheers, Jan-Paul
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail:
user-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message