spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From guojc <guoj...@gmail.com>
Subject Re: Does foreach operation increase rdd lineage?
Date Fri, 24 Jan 2014 13:08:37 GMT
Yes, I means Gibbs sampling. From the api document, I don't see why the
data will be collected to driver. The document say that '
def foreach(f: (T) => Unit): Unit
Applies a function f to all elements of this RDD.'

So If I want to change my data in place, what operation I should use?

Best Regards,
Jiacheng Guo


On Fri, Jan 24, 2014 at 9:03 PM, 尹绪森 <yinxusen@gmail.com> wrote:

> Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will
> collect all data from workers to driver. You will get OOM complained by JVM.
>
> I am not very sure of your implementation, but if data not need to join
> together, you'd better keep them in workers.
>
>
> 2014/1/24 guojc <guojc03@gmail.com>
>
>> Hi,
>>    I'm writing a paralell mcmc program that having a very large dataset
>> in memory, and need to update the dataset in-memory and avoid creating
>> additional copy. Should I choose a foreach operation on rdd to express the
>> change? or I have to create a new rdd after each sampling process?
>>
>> Thanks,
>> Jiacheng Guo
>>
>
>
>
> --
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Beijing Key Laboratory of Intelligent Telecommunications Software and
> Multimedia
> Beijing University of Posts & Telecommunications
> Intel Labs China
> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>

Mime
View raw message