spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Сергей Лихоман <sergliho...@gmail.com>
Subject Re: Compact RDD representation
Date Sun, 19 Jul 2015 18:09:45 GMT
Thanks for answer! Could you please answer for one more question? Will we
have in memory original rdd and grouped rdd in the same time?

2015-07-19 21:04 GMT+03:00 Sandy Ryza <sandy.ryza@cloudera.com>:

> Edit: the first line should read:
>
>   val groupedRdd = rdd.map((_, 1)).reduceByKey(_ + _)
>
> On Sun, Jul 19, 2015 at 11:02 AM, Sandy Ryza <sandy.ryza@cloudera.com>
> wrote:
>
>> This functionality already basically exists in Spark.  To create the
>> "grouped RDD", one can run:
>>
>>   val groupedRdd = rdd.reduceByKey(_ + _)
>>
>> To get it back into the original form:
>>
>>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>>
>> -Sandy
>>
>> -Sandy
>>
>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <serglihoman@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am looking for suitable issue for Master Degree project(it sounds like
>>> scalability problems and improvements for spark streaming) and seems like
>>> introduction of grouped RDD(for example: don't store
>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>>
>>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of
>>> uniq messages)
>>> 2. Improve performance(no need to apply function several times for the
>>> same message).
>>>
>>> Can I create ticket and introduce API for grouped RDDs? Is it make
>>> sense? Also I will be very appreciated for critic and ideas
>>>
>>
>>
>

Mime
View raw message