spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: Does foreach operation increase rdd lineage?
Date Sat, 25 Jan 2014 23:03:29 GMT
Or just checkpoint() it.


On Sat, Jan 25, 2014 at 2:40 PM, Jason Lenderman <jslenderman@gmail.com>wrote:

> RDDs are supposed to be immutable. Changing values using foreach seems
> like a bad thing to do, and is going to mess up the probability in some
> very difficult to understand fashion if you wind up losing a partition of
> your state that needs to be regenerated.
>
> Each update of the state of your markov chain should be a new RDD. I've
> found that I can do this for 100 or 200 iterations and then I'll get a
> stack overflow (presumably because the lineage is growing too large.) To
> get around this you either need to occasionally collect the RDD or write it
> to disk.
>
>
> On Fri, Jan 24, 2014 at 5:22 AM, 尹绪森 <yinxusen@gmail.com> wrote:
>
>> foreach is an action, from the source code you can see that it call
>> runJob method. In spark, it is difficult to change data in place, for it
>> has a functional semantic.
>>
>> I think "mapPartitions" is more suitable for machine learning algorithms.
>> I am writing a LDA for mllib, you can have a look if you like, but not very
>> deep optimized yet. I will do more extra work to optimize it.
>>
>>
>> https://github.com/yinxusen/incubator-spark/blob/lda-mahout/mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala
>>
>>
>>
>> 2014/1/24 guojc <guojc03@gmail.com>
>>
>>> Yes, I means Gibbs sampling. From the api document, I don't see why the
>>> data will be collected to driver. The document say that '
>>> def foreach(f: (T) => Unit): Unit
>>> Applies a function f to all elements of this RDD.'
>>>
>>> So If I want to change my data in place, what operation I should use?
>>>
>>> Best Regards,
>>> Jiacheng Guo
>>>
>>>
>>> On Fri, Jan 24, 2014 at 9:03 PM, 尹绪森 <yinxusen@gmail.com> wrote:
>>>
>>>> Do you mean "Gibbs sampling" ? Actually, foreach is an action, it will
>>>> collect all data from workers to driver. You will get OOM complained by JVM.
>>>>
>>>> I am not very sure of your implementation, but if data not need to join
>>>> together, you'd better keep them in workers.
>>>>
>>>>
>>>> 2014/1/24 guojc <guojc03@gmail.com>
>>>>
>>>>> Hi,
>>>>>    I'm writing a paralell mcmc program that having a very large
>>>>> dataset in memory, and need to update the dataset in-memory and avoid
>>>>> creating additional copy. Should I choose a foreach operation on rdd
to
>>>>> express the change? or I have to create a new rdd after each sampling
>>>>> process?
>>>>>
>>>>> Thanks,
>>>>> Jiacheng Guo
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>> -----------------------------------
>>>> Xusen Yin    尹绪森
>>>> Beijing Key Laboratory of Intelligent Telecommunications Software and
>>>> Multimedia
>>>> Beijing University of Posts & Telecommunications
>>>> Intel Labs China
>>>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards
>> -----------------------------------
>> Xusen Yin    尹绪森
>> Beijing Key Laboratory of Intelligent Telecommunications Software and
>> Multimedia
>> Beijing University of Posts & Telecommunications
>> Intel Labs China
>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>>
>
>

Mime
View raw message