foreach is an action, from the source code you can see that it call runJob method. In spark, it is difficult to change data in place, for it has a functional semantic. 

I think "mapPartitions" is more suitable for machine learning algorithms. I am writing a LDA for mllib, you can have a look if you like, but not very deep optimized yet. I will do more extra work to optimize it.

https://github.com/yinxusen/incubator-spark/blob/lda-mahout/mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala



2014/1/24 guojc <guojc03@gmail.com>
Yes, I means Gibbs sampling. From the api document, I don't see why the data will be collected to driver. The document say that '
def foreach(f: (T) ⇒ Unit)Unit
Applies a function f to all elements of this RDD.'

So If I want to change my data in place, what operation I should use?

Best Regards,
Jiacheng Guo


On Fri, Jan 24, 2014 at 9:03 PM, ɭ <yinxusen@gmail.com> wrote:
Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will collect all data from workers to driver. You will get OOM complained by JVM.

I am not very sure of your implementation, but if data not need to join together, you'd better keep them in workers.


2014/1/24 guojc <guojc03@gmail.com>
Hi,
   I'm writing a paralell mcmc program that having a very large dataset in memory, and need to update the dataset in-memory and avoid creating additional copy. Should I choose a foreach operation on rdd to express the change? or I have to create a new rdd after each sampling process?

Thanks,
Jiacheng Guo



--
Best Regards
-----------------------------------
Xusen Yin    ɭ
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China




--
Best Regards
-----------------------------------
Xusen Yin    ɭ
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China
Homepage: http://yinxusen.github.io/