spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Lenderman <jslender...@gmail.com>
Subject Re: Does foreach operation increase rdd lineage?
Date Sat, 25 Jan 2014 22:40:54 GMT
RDDs are supposed to be immutable. Changing values using foreach seems like
a bad thing to do, and is going to mess up the probability in some very
difficult to understand fashion if you wind up losing a partition of your
state that needs to be regenerated.

Each update of the state of your markov chain should be a new RDD. I've
found that I can do this for 100 or 200 iterations and then I'll get a
stack overflow (presumably because the lineage is growing too large.) To
get around this you either need to occasionally collect the RDD or write it
to disk.


On Fri, Jan 24, 2014 at 5:22 AM, 尹绪森 <yinxusen@gmail.com> wrote:

> foreach is an action, from the source code you can see that it call runJob
> method. In spark, it is difficult to change data in place, for it has a
> functional semantic.
>
> I think "mapPartitions" is more suitable for machine learning algorithms.
> I am writing a LDA for mllib, you can have a look if you like, but not very
> deep optimized yet. I will do more extra work to optimize it.
>
>
> https://github.com/yinxusen/incubator-spark/blob/lda-mahout/mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala
>
>
>
> 2014/1/24 guojc <guojc03@gmail.com>
>
>> Yes, I means Gibbs sampling. From the api document, I don't see why the
>> data will be collected to driver. The document say that '
>> def foreach(f: (T) => Unit): Unit
>> Applies a function f to all elements of this RDD.'
>>
>> So If I want to change my data in place, what operation I should use?
>>
>> Best Regards,
>> Jiacheng Guo
>>
>>
>> On Fri, Jan 24, 2014 at 9:03 PM, 尹绪森 <yinxusen@gmail.com> wrote:
>>
>>> Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will
>>> collect all data from workers to driver. You will get OOM complained by JVM.
>>>
>>> I am not very sure of your implementation, but if data not need to join
>>> together, you'd better keep them in workers.
>>>
>>>
>>> 2014/1/24 guojc <guojc03@gmail.com>
>>>
>>>> Hi,
>>>>    I'm writing a paralell mcmc program that having a very large dataset
>>>> in memory, and need to update the dataset in-memory and avoid creating
>>>> additional copy. Should I choose a foreach operation on rdd to express the
>>>> change? or I have to create a new rdd after each sampling process?
>>>>
>>>> Thanks,
>>>> Jiacheng Guo
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>> -----------------------------------
>>> Xusen Yin    尹绪森
>>> Beijing Key Laboratory of Intelligent Telecommunications Software and
>>> Multimedia
>>> Beijing University of Posts & Telecommunications
>>> Intel Labs China
>>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>>>
>>
>>
>
>
> --
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Beijing Key Laboratory of Intelligent Telecommunications Software and
> Multimedia
> Beijing University of Posts & Telecommunications
> Intel Labs China
> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>

Mime
View raw message