spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com>
Subject Re: Compact RDD representation
Date Mon, 20 Jul 2015 08:31:37 GMT
Hi,

I'm not an authority in the Spark community, but what I would do is adding
the project to spark packages http://spark-packages.org/. In fact I think
this case is similar to IndexedRDD, which is also in spark packages
http://spark-packages.org/package/amplab/spark-indexedrdd

2015-07-19 21:49 GMT+02:00 Сергей Лихоман <serglihoman@gmail.com>:

> Hi Juan,
>
> It's exactly what I meant. if we will have high load with many repetitions it
> can significantly reduce rdd size and improve performance. in real use
> cases application frequently need to enrich data from cache or external
> system, so we will save time on each repetition.
> I will also do some experiments.  About little repetitions: in what use
> cases we will lose efficiency? it will also test it.
> What I need to do this commitment? Just create ticket in Jira?
>
>
>
> 2015-07-19 21:56 GMT+03:00 Juan Rodríguez Hortalá <
> juan.rodriguez.hortala@gmail.com>:
>
>> Hi,
>>
>> My two cents is that that could be interesting if all RDD and pair
>> RDD operations would be lifted to work on groupedRDD. For example as
>> suggested a map on grouped RDDs would be more efficient if the original RDD
>> had lots of duplicate entries, but for RDDs with little repetitions I guess
>> you in fact lose efficiency. The same applies to filter, sortBy, count,
>> max, ... but for example I guess there is no gain for reduce and other
>> operations. Also note the order is lost when passing to grouped RDD, so the
>> semantics is not exactly the same, but would be good enough for
>> many applications. Also I would look for suitable use cases where RDD with
>> many repetitions arise naturally, and the transformations with performance
>> gain like map are used often, and I would do some experiments to compare
>> performance between a computation with grouped RDD and the same computation
>> without grouping, for different input sizes
>>
>>
>> El domingo, 19 de julio de 2015, Sandy Ryza <sandy.ryza@cloudera.com>
>> escribió:
>>
>>> This functionality already basically exists in Spark.  To create the
>>> "grouped RDD", one can run:
>>>
>>>   val groupedRdd = rdd.reduceByKey(_ + _)
>>>
>>> To get it back into the original form:
>>>
>>>   groupedRdd.flatMap(x => List.fill(x._1)(x._2))
>>>
>>> -Sandy
>>>
>>> -Sandy
>>>
>>> On Sun, Jul 19, 2015 at 10:40 AM, Сергей Лихоман <serglihoman@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking for suitable issue for Master Degree project(it sounds
>>>> like scalability problems and improvements for spark streaming) and seems
>>>> like introduction of grouped RDD(for example: don't store
>>>> "Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
>>>>
>>>> 1. Reduce memory needed for RDD (roughly, used memory will be:  % of
>>>> uniq messages)
>>>> 2. Improve performance(no need to apply function several times for the
>>>> same message).
>>>>
>>>> Can I create ticket and introduce API for grouped RDDs? Is it make
>>>> sense? Also I will be very appreciated for critic and ideas
>>>>
>>>
>>>
>

Mime
View raw message