spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valery Khamenya <>
Subject smarter way to "forget" DataFrame definition and stick to its values
Date Tue, 01 May 2018 13:16:23 GMT
hi all

a short example before the long story:

  var accumulatedDataFrame = ... // initialize

  for (i <- 1 to 100) {
    val myTinyNewData = ... // my slowly calculated new data portion in
tiny amounts
    accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData)
    // how  to stick here to the values of accumulatedDataFrame only and
forget definitions?!

this kind of stuff is likely to get slower and slower on each iteration
even if myTinyNewData is quite compact. Usually I write accumulatedDataFrame
to S3 and then re-load it back to clear the definition history. It makes
code ugly though. Are there any smarter way?

It happens very often that a DataFrame is created via complex definitions.
The DataFrame is then re-used in several places and sometimes it gets
recalculated triggering a heavy cascade of operations.

Of course one could use .persist or .cache modifiers, but the result is
unfortunately not transparent and instead of speeding up things it results
in slow-down or even lost jobs if storage resources are not enough.

Any advice?

best regards

View raw message