hi all
a short example before the long story:
var accumulatedDataFrame = ... // initialize
for (i <- 1 to 100) {
val myTinyNewData = ... // my slowly calculated new data portion in
tiny amounts
accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData)
// how to stick here to the values of accumulatedDataFrame only and
forget definitions?!
}
this kind of stuff is likely to get slower and slower on each iteration
even if myTinyNewData is quite compact. Usually I write accumulatedDataFrame
to S3 and then re-load it back to clear the definition history. It makes
code ugly though. Are there any smarter way?
It happens very often that a DataFrame is created via complex definitions.
The DataFrame is then re-used in several places and sometimes it gets
recalculated triggering a heavy cascade of operations.
Of course one could use .persist or .cache modifiers, but the result is
unfortunately not transparent and instead of speeding up things it results
in slow-down or even lost jobs if storage resources are not enough.
Any advice?
best regards
--
Valery
|