spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <Jayesh.Lalw...@capitalone.com>
Subject Re: smarter way to "forget" DataFrame definition and stick to its values
Date Wed, 02 May 2018 12:28:30 GMT
There is a trade off involved here. If you have a Spark application with a complicated logical
graph, you can either cache data at certain points in the DAG, or you don’t cache data.
The side effect of caching data is higher memory usage. The side effect of not caching data
is higher CPU usage and perhaps, IO. Ultimately, you can increase both memory and CPU by adding
more workers to your cluster, and adding workers costs money. So, your caching choices are
reflected in the overall cost of running your application. You need to do some analysis to
determine the caching configuration the will result in lowest cost. Usually, being selective
about which dataframes to cache results in a good balance between memory usage and CPU usage

I will not write data back to S3 and read it back in as a practice. Essentially, you are using
S3 as a “cache”. However, reading and writing from S3 is not a scalable solution because
it results in higher IO and IO doesn’t scale up as easily as CPU and Memory. The only time
I would use S3 as a cache will be when by cached data is in terabyte+ range. If you are caching
gigabytes of data, then you are better off caching in memory. This is 2018. Memory is cheap
but limited.

From: Valery Khamenya <khamenya@gmail.com>
Date: Tuesday, May 1, 2018 at 9:17 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: smarter way to "forget" DataFrame definition and stick to its values

hi all

a short example before the long story:

  var accumulatedDataFrame = ... // initialize

  for (i <- 1 to 100) {
    val myTinyNewData = ... // my slowly calculated new data portion in tiny amounts
    accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData)
    // how  to stick here to the values of accumulatedDataFrame only and forget definitions?!
  }

this kind of stuff is likely to get slower and slower on each iteration even if myTinyNewData
is quite compact. Usually I write accumulatedDataFrame to S3 and then re-load it back to clear
the definition history. It makes code ugly though. Are there any smarter way?

It happens very often that a DataFrame is created via complex definitions. The DataFrame is
then re-used in several places and sometimes it gets recalculated triggering a heavy cascade
of operations.

Of course one could use .persist or .cache modifiers, but the result is unfortunately not
transparent and instead of speeding up things it results in slow-down or even lost jobs if
storage resources are not enough.

Any advice?

best regards
--
Valery
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates and may only be used solely in performance of work or services for Capital
One. The information transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is strictly prohibited.
If you have received this communication in error, please contact the sender and delete the
material from your computer.
Mime
View raw message