spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: Lazy evaluation of RDD data transformation
Date Tue, 21 Jan 2014 19:55:32 GMT
If you don’t cache the RDD, the computation will happen over and over each time we scan through
it. This is done to save memory in that case and because Spark can’t know at the beginning
whether you plan to access a dataset multiple times. If you’d like to prevent this, use
cache(), or maybe persist(StorageLevel.DISK_ONLY) if you don’t want to keep it in memory.


On Jan 21, 2014, at 11:32 AM, DB Tsai <> wrote:

> Hi,
> When the data is read from HDFS using textFile, and then map function is performed as
the following code to make the format right in order to feed it into mllib training algorithms.
> rddFile =  sc.textFile("Some file on HDFS")
> rddData = => {
>       val temp = line.toString.split(",")
>       val y = temp(3) match {
>         case "1" => 0.0
>         case "2" => 1.0
>         case _ => 2.0
>       }
>       val x = temp.slice(1, 3).map(_.toDouble)
>       LabeledPoint(y, x)
> })
> My question is that when the map function is performed? Is it lazy evaluated when we
use rddData first time and generate another new dataset called rddData since RDD is immutable?
Does it mean the second time we use rddData, the transformation isn't computed?
> Or the transformation is computed in real time, so we don't need extra memory for this?
> The motivation for asking this question is that I found in mllib library, there are lots
of extra transformation is done. For example, the intercept is added by map( point -> new
LabeledPoint(point.y, Array( 1, point.feature))
> If the new dataset is generated every time when the map is performed, for a really big
dataset, it will waste lots of memory and IO. Also, it will be less efficiency, when we chain
several map function to RDD since all of them can be done in one place.
> Thanks.
> Sincerely,
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --------------------------------------
> Web:

View raw message