From user-return-35736-apmail-spark-user-archive=spark.apache.org@spark.apache.org Wed Jun 17 16:37:26 2015 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D1FFA1823B for ; Wed, 17 Jun 2015 16:37:26 +0000 (UTC) Received: (qmail 36833 invoked by uid 500); 17 Jun 2015 16:37:23 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 36758 invoked by uid 500); 17 Jun 2015 16:37:23 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 36748 invoked by uid 99); 17 Jun 2015 16:37:23 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jun 2015 16:37:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CDBCAC009B for ; Wed, 17 Jun 2015 16:37:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.109 X-Spam-Level: X-Spam-Status: No, score=-1.109 tagged_above=-999 required=6.31 tests=[RCVD_IN_MSPIKE_H2=-1.108, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id PGKv9rETXZYn for ; Wed, 17 Jun 2015 16:37:21 +0000 (UTC) Received: from nk11p04mm-asmtp002.mac.com (nk11p04mm-asmtp002.mac.com [17.158.236.237]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 7932643A5B for ; Wed, 17 Jun 2015 16:37:21 +0000 (UTC) Received: from [172.20.10.7] (tmo-100-172.customers.d1-online.com [80.187.100.172]) by nk11p04mm-asmtp002.mac.com (Oracle Communications Messaging Server 7.0.5.35.0 64bit (built Mar 31 2015)) with ESMTPSA id <0NQ3000CPKTHHF50@nk11p04mm-asmtp002.mac.com> for user@spark.apache.org; Wed, 17 Jun 2015 16:37:07 +0000 (GMT) Content-type: text/plain; charset=utf-8 MIME-version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Subject: Re: generateTreeString causes huge performance problems on dataframe persistence From: Jan-Paul Bultmann In-reply-to: <80833ADD533E324CA05C160E41B6366102908A94@shsmsx102.ccr.corp.intel.com> Date: Wed, 17 Jun 2015 18:36:51 +0200 Content-transfer-encoding: quoted-printable Message-id: <8C7BC1C6-47A2-4FD7-BFFE-609B5BA401F7@me.com> References: <32962A69-1F21-4AE7-B078-E25A14FD3F96@me.com> <80833ADD533E324CA05C160E41B6366102908A94@shsmsx102.ccr.corp.intel.com> To: "Cheng, Hao" , User X-Mailer: Apple Mail (2.2098) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.14.151,1.0.33,0.0.0000 definitions=2015-06-17_06:2015-06-16,2015-06-17,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=2 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1412110000 definitions=main-1506170274 > Seems you're hitting the self-join, currently Spark SQL won't cache = any result/logical tree for further analyzing or computing for = self-join. Other joins don=E2=80=99t suffer from this problem? > Since the logical tree is huge, it's reasonable to take long time in = generating its tree string recursively. The internal structure is basically a graph though, right? Where equal cached subtrees are structurally shared by reference instead = of copying them by value. Is the `generateTreeString` result needed for anything other than giving = the RDD a nice name? It seems rather wasteful to compute a graphs unfolding into a tree for = this. > And I also doubt the computing can finish within a reasonable time, as = there probably be lots of partitions (grows exponentially) of the = intermediate result. >=20 Possibly, so far the number of partitions stayed the same though. But I didn=E2=80=99t run that many iterations due to the problem :). > As a workaround, you can break the iterations into smaller ones and = trigger them manually in sequence. You mean` write` ing them to disk after each iteration? Thanks :), Jan > -----Original Message----- > From: Jan-Paul Bultmann [mailto:janpaulbultmann@me.com]=20 > Sent: Wednesday, June 17, 2015 6:17 PM > To: User > Subject: generateTreeString causes huge performance problems on = dataframe persistence >=20 > Hey, > I noticed that my code spends hours with `generateTreeString` even = though the actual dag/dataframe execution takes seconds. >=20 > I=E2=80=99m running a query that grows exponential in the number of = iterations when evaluated without caching, but should be linear when = caching previous results. >=20 > E.g. >=20 > result_i+1 =3D distinct(join(result_i, result_i)) >=20 > Which evaluates exponentially like this this without caching. >=20 > Iteration | Dataframe Plan Tree > 0 | /\ > 1 | /\ /\ > 2 | /\/\ /\/\ > n | =E2=80=A6=E2=80=A6=E2=80=A6. >=20 > But should be linear with caching. >=20 > Iteration | Dataframe Plan Tree > 0 | /\ > | \/ > 1 | /\ > | \/ > 2 | /\ > | \/ > n | =E2=80=A6=E2=80=A6=E2=80=A6. >=20 >=20 > It seems that even though the DAG will have the later form, = `generateTreeString` will walk the entire plan naively as if no caching = was done. >=20 > The spark webui also shows no active jobs even though my CPU uses one = core fully, calculating that string. >=20 > Below is the piece of stacktrace that starts the entire walk. >=20 > ^ > | > Thousands of calls to `generateTreeString`. > | > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(int, = StringBuilder) TreeNode.scala:431 > org.apache.spark.sql.catalyst.trees.TreeNode.treeString() = TreeNode.scala:400 > org.apache.spark.sql.catalyst.trees.TreeNode.toString() = TreeNode.scala:397 > = org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.app= ly() InMemoryColumnarTableScan.scala:164 > = org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$buildBuffers$2.app= ly() InMemoryColumnarTableScan.scala:164 > scala.Option.getOrElse(Function0) Option.scala:120 > org.apache.spark.sql.columnar.InMemoryRelation.buildBuffers() = InMemoryColumnarTableScan.scala:164 > org.apache.spark.sql.columnar.InMemoryRelation.(Seq, boolean, = int, StorageLevel, SparkPlan, Option, RDD, Statistics, Accumulable) = InMemoryColumnarTableScan.scala:112 > org.apache.spark.sql.columnar.InMemoryRelation$.apply(boolean, int, = StorageLevel, SparkPlan, Option) InMemoryColumnarTableScan.scala:45 > = org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply() = CacheManager.scala:102 > org.apache.spark.sql.execution.CacheManager.writeLock(Function0) = CacheManager.scala:70 = org.apache.spark.sql.execution.CacheManager.cacheQuery(DataFrame, = Option, StorageLevel) CacheManager.scala:94 > org.apache.spark.sql.DataFrame.persist(StorageLevel) = DataFrame.scala:1320 ^ > | > Application logic. > | >=20 > Could someone confirm my suspicion? > And does somebody know why it=E2=80=99s called while caching, and why = it walks the entire tree including cached results? >=20 > Cheers, Jan-Paul > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For = additional commands, e-mail: user-help@spark.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org