spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 林武康 <vboylin1...@gmail.com>
Subject 答复: RDD usage
Date Tue, 25 Mar 2014 01:45:42 GMT
Hi hequn, a relative question, is that mean the memory usage will doubled? And further more,
if the compute function in a rdd is not idempotent, rdd will changed during the job running,
is that right? 

-----原始邮件-----
发件人: "hequn cheng" <chenghequn@gmail.com>
发送时间: ‎2014/‎3/‎25 9:35
收件人: "user@spark.apache.org" <user@spark.apache.org>
主题: Re: RDD usage

points.foreach(p=>p.y = another_value) will return a new modified RDD. 




2014-03-24 18:13 GMT+08:00 Chieh-Yen <r01944006@csie.ntu.edu.tw>:

Dear all,


I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:


case class AppDataPoint(input_y : Double, input_x : Array[Double]) extends Serializable {
  var y : Double = input_y
  var x : Array[Double] = input_x
  ......
}
Furthermore, I created the RDD by the following function.


def parsePoint(line: String): AppDataPoint = {
  /* Some related works for parsing */
  ......
}


Assume the RDD called "points":


val lines = sc.textFile(inputPath, numPartition)
var points = lines.map(parsePoint _).cache()


The question is that, I tried to modify the value of this RDD, the operation is:


points.foreach(p=>p.y = another_value)


The operation is workable.
There doesn't have any warning or error message showed by the system and the results are right.
I wonder that if the modification for RDD is a correct and in fact workable design.
The usage web said that the RDD is immutable, is there any suggestion?


Thanks a lot.


Chieh-Yen Lin
Mime
View raw message