spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 林武康 <>
Subject 答复: 答复: RDD usage
Date Tue, 25 Mar 2014 03:11:26 GMT
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, firstly, immutable
is a feather of rdd but not a solid rule, there are ways to change it, for excample, a rdd
with non-idempotent "compute" function, though it is really a bad design to make that function
non-idempotent for uncontrollable side-effect. I agree with Mark that foreach can modify the
elements of a rdd, but we should avoid this because it will effect all the rdds generate by
this changed rdd , make the whole process inconsistent and unstable.

Some rough opinions on the immutable feature of rdd, full discuss can make it more clear.
Any ideas?

发件人: "hequn cheng" <>
发送时间: ‎2014/‎3/‎25 10:40
收件人: "" <>
主题: Re: 答复: RDD usage

First question:
If you save your modified RDD like this:
points.foreach(p=>p.y = another_value).collect() or 
points.foreach(p=>p.y = another_value).saveAsTextFile(...)
the modified RDD will be materialized and this will not use any work's memory.
If you have more transformatins after the map(), the spark will pipelines all transformations
and build a DAG. Very little memory will be used in this stage and the memory will be free
Only cache() will persist your RDD in memory for a long time.
Second question:
Once RDD be created, it can not be changed due to the immutable feature.You can only create
a new RDD from the existing RDD or from file system.

2014-03-25 9:45 GMT+08:00 林武康 <>:

Hi hequn, a relative question, is that mean the memory usage will doubled? And further more,
if the compute function in a rdd is not idempotent, rdd will changed during the job running,
is that right? 

发件人: hequn cheng
发送时间: 2014/3/25 9:35
主题: Re: RDD usage

points.foreach(p=>p.y = another_value) will return a new modified RDD. 

2014-03-24 18:13 GMT+08:00 Chieh-Yen <>:

Dear all,

I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:

case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable {
  var y : Double = input_y
  var x : Array[Double] = input_x
Furthermore, I created the RDD by the following function.

def parsePoint(line: String): AppDataPoint = {
  /* Some related works for parsing */

Assume the RDD called "points":

val lines = sc.textFile(inputPath, numPartition)
var points = _).cache()

The question is that, I tried to modify the value of this RDD, the operation is:

points.foreach(p=>p.y = another_value)

The operation is workable.
There doesn't have any warning or error message showed by the system and the results are right.
I wonder that if the modification for RDD is a correct and in fact workable design.
The usage web said that the RDD is immutable, is there any suggestion?

Thanks a lot.

Chieh-Yen Lin
View raw message