spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kiran lonikar <loni...@gmail.com>
Subject Re: RDD of RDDs
Date Wed, 10 Jun 2015 02:55:46 GMT
Yes true. That's why I said if and when.

But hopefully I have given correct explanation of why rdd of rdd is not
possible.
On 09-Jun-2015 10:22 pm, "Mark Hamstra" <mark@clearstorydata.com> wrote:

> That would constitute a major change in Spark's architecture.  It's not
> happening anytime soon.
>
> On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar <lonikar@gmail.com> wrote:
>
>> Possibly in future, if and when spark architecture allows workers to
>> launch spark jobs (the functions passed to transformation or action APIs of
>> RDD), it will be possible to have RDD of RDD.
>>
>> On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar <lonikar@gmail.com> wrote:
>>
>>> Simillar question was asked before:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
>>>
>>> Here is one of the reasons why I think RDD[RDD[T]] is not possible:
>>>
>>>    - RDD is only a handle to the actual data partitions. It has a
>>>    reference/pointer to the *SparkContext* object (*sc*) and a list of
>>>    partitions.
>>>    - The *SparkContext *is an object in the Spark Application/Driver
>>>    Program's JVM. Similarly, the list of partitions is also in the JVM of the
>>>    driver program. Each partition contains kind of "remote references" to the
>>>    partition data on the worker JVMs.
>>>    - The functions passed to RDD's transformations and actions execute
>>>    in the worker's JVMs on different nodes. For example, in "*rdd.map {
>>>    x => x*x }*", the function performing "*x*x*" runs on the JVMs of
>>>    the worker nodes where the partitions of the RDD reside. These JVMs do not
>>>    have access to the "*sc*" since its only on the driver's JVM.
>>>    - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd =>
>>>    innerRDD.filter { x => x*x } }*, the worker nodes will not be able
>>>    to execute the *filter* on *innerRDD *as the code in the worker does
>>>    not have access to "sc" and can not launch a spark job.
>>>
>>>
>>> Hope it helps. You need to consider List[RDD] or some other collection.
>>>
>>> -Kiran
>>>
>>> On Tue, Jun 9, 2015 at 2:25 AM, ping yan <sharonyan@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> The problem I am looking at is as follows:
>>>>
>>>> - I read in a log file of multiple users as a RDD
>>>>
>>>> - I'd like to group the above RDD into *multiple RDDs* by userIds (the
>>>> key)
>>>>
>>>> - my processEachUser() function then takes in each RDD mapped into
>>>> each individual user, and calls for RDD.map or DataFrame operations on
>>>> them. (I already had the function coded, I am therefore reluctant to work
>>>> with the ResultIterable object coming out of rdd.groupByKey() ... )
>>>>
>>>> I've searched the mailing list and googled on "RDD of RDDs" and seems
>>>> like it isn't a thing at all.
>>>>
>>>> A few choices left seem to be: 1) groupByKey() and then work with the
>>>> ResultIterable object; 2) groupbyKey() and then write each group into a
>>>> file, and read them back as individual rdds to process..
>>>>
>>>> Anyone got a better idea or had a similar problem before?
>>>>
>>>>
>>>> Thanks!
>>>> Ping
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ping Yan
>>>> Ph.D. in Management
>>>> Dept. of Management Information Systems
>>>> University of Arizona
>>>> Tucson, AZ 85721
>>>>
>>>>
>>>
>>
>

Mime
View raw message