spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valery Khamenya <khame...@gmail.com>
Subject smarter way to "forget" DataFrame definition and stick to its values
Date Tue, 01 May 2018 13:16:23 GMT
hi all

a short example before the long story:

  var accumulatedDataFrame = ... // initialize

  for (i <- 1 to 100) {
    val myTinyNewData = ... // my slowly calculated new data portion in
tiny amounts
    accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData)
    // how  to stick here to the values of accumulatedDataFrame only and
forget definitions?!
  }

this kind of stuff is likely to get slower and slower on each iteration
even if myTinyNewData is quite compact. Usually I write accumulatedDataFrame
to S3 and then re-load it back to clear the definition history. It makes
code ugly though. Are there any smarter way?

It happens very often that a DataFrame is created via complex definitions.
The DataFrame is then re-used in several places and sometimes it gets
recalculated triggering a heavy cascade of operations.

Of course one could use .persist or .cache modifiers, but the result is
unfortunately not transparent and instead of speeding up things it results
in slow-down or even lost jobs if storage resources are not enough.

Any advice?

best regards
--
Valery

Mime
View raw message