spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: RDD of RDDs
Date Tue, 09 Jun 2015 16:52:19 GMT
That would constitute a major change in Spark's architecture.  It's not
happening anytime soon.

On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar <lonikar@gmail.com> wrote:

> Possibly in future, if and when spark architecture allows workers to
> launch spark jobs (the functions passed to transformation or action APIs of
> RDD), it will be possible to have RDD of RDD.
>
> On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar <lonikar@gmail.com> wrote:
>
>> Simillar question was asked before:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
>>
>> Here is one of the reasons why I think RDD[RDD[T]] is not possible:
>>
>>    - RDD is only a handle to the actual data partitions. It has a
>>    reference/pointer to the *SparkContext* object (*sc*) and a list of
>>    partitions.
>>    - The *SparkContext *is an object in the Spark Application/Driver
>>    Program's JVM. Similarly, the list of partitions is also in the JVM of the
>>    driver program. Each partition contains kind of "remote references" to the
>>    partition data on the worker JVMs.
>>    - The functions passed to RDD's transformations and actions execute
>>    in the worker's JVMs on different nodes. For example, in "*rdd.map {
>>    x => x*x }*", the function performing "*x*x*" runs on the JVMs of the
>>    worker nodes where the partitions of the RDD reside. These JVMs do not have
>>    access to the "*sc*" since its only on the driver's JVM.
>>    - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd =>
>>    innerRDD.filter { x => x*x } }*, the worker nodes will not be able to
>>    execute the *filter* on *innerRDD *as the code in the worker does not
>>    have access to "sc" and can not launch a spark job.
>>
>>
>> Hope it helps. You need to consider List[RDD] or some other collection.
>>
>> -Kiran
>>
>> On Tue, Jun 9, 2015 at 2:25 AM, ping yan <sharonyan@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> The problem I am looking at is as follows:
>>>
>>> - I read in a log file of multiple users as a RDD
>>>
>>> - I'd like to group the above RDD into *multiple RDDs* by userIds (the
>>> key)
>>>
>>> - my processEachUser() function then takes in each RDD mapped into
>>> each individual user, and calls for RDD.map or DataFrame operations on
>>> them. (I already had the function coded, I am therefore reluctant to work
>>> with the ResultIterable object coming out of rdd.groupByKey() ... )
>>>
>>> I've searched the mailing list and googled on "RDD of RDDs" and seems
>>> like it isn't a thing at all.
>>>
>>> A few choices left seem to be: 1) groupByKey() and then work with the
>>> ResultIterable object; 2) groupbyKey() and then write each group into a
>>> file, and read them back as individual rdds to process..
>>>
>>> Anyone got a better idea or had a similar problem before?
>>>
>>>
>>> Thanks!
>>> Ping
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Ping Yan
>>> Ph.D. in Management
>>> Dept. of Management Information Systems
>>> University of Arizona
>>> Tucson, AZ 85721
>>>
>>>
>>
>

Mime
View raw message