spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-13346) Using DataFrames iteratively leads to massive query plans, which slows execution
Date Fri, 13 May 2016 19:37:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283052#comment-15283052
] 

Joseph K. Bradley commented on SPARK-13346:
-------------------------------------------

Sure, the practical applications are pretty much every MLlib and GraphX algorithm.  In order
to move any of those implementations to run on top of DataFrames, we will need to have this
fixed.

For a concrete case with executable code, check out the BeliefPropagation example here: [https://github.com/graphframes/graphframes/blob/ac4a7c82dbde6529c98b3249a262cb958adaac43/src/main/scala/org/graphframes/examples/BeliefPropagation.scala]
 You can see a hack {{getCachedDataFrame}} which converts the current iteration's DataFrame
to an RDD, caches it, and converts it back to a DataFrame.  Without this, the BP example dies
after ~3 iterations on a tiny example graph (so it should be easy to reproduce the failure).

Let me know if I can be of help in exploring the failures; I have other code snippets too.

> Using DataFrames iteratively leads to massive query plans, which slows execution
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-13346
>                 URL: https://issues.apache.org/jira/browse/SPARK-13346
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Joseph K. Bradley
>
> I have an iterative algorithm based on DataFrames, and the query plan grows very quickly
with each iteration.  Caching the current DataFrame at the end of an iteration does not fix
the problem.  However, converting the DataFrame to an RDD and back at the end of each iteration
does fix the problem.
> Printing the query plans shows that the plan explodes quickly (10 lines, to several hundred
lines, to several thousand lines, ...) with successive iterations.
> The desired behavior is for the analyzer to recognize that a big chunk of the query plan
does not need to be computed since it is already cached.  The computation on each iteration
should be the same.
> If useful, I can push (complex) code to reproduce the issue.  But it should be simple
to see if you create an iterative algorithm which produces a new DataFrame from an old one
on each iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message