spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kiran lonikar <loni...@gmail.com>
Subject Re: RDD of RDDs
Date Tue, 09 Jun 2015 08:17:06 GMT
Simillar question was asked before:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html

Here is one of the reasons why I think RDD[RDD[T]] is not possible:

   - RDD is only a handle to the actual data partitions. It has a
   reference/pointer to the *SparkContext* object (*sc*) and a list of
   partitions.
   - The *SparkContext *is an object in the Spark Application/Driver
   Program's JVM. Similarly, the list of partitions is also in the JVM of the
   driver program. Each partition contains kind of "remote references" to the
   partition data on the worker JVMs.
   - The functions passed to RDD's transformations and actions execute in
   the worker's JVMs on different nodes. For example, in "*rdd.map { x =>
   x*x }*", the function performing "*x*x*" runs on the JVMs of the worker
   nodes where the partitions of the RDD reside. These JVMs do not have access
   to the "*sc*" since its only on the driver's JVM.
   - Thus, in case of your *RDD of RDD*: *outerRDD.map { innerRdd =>
   innerRDD.filter { x => x*x } }*, the worker nodes will not be able to
   execute the *filter* on *innerRDD *as the code in the worker does not
   have access to "sc" and can not launch a spark job.


Hope it helps. You need to consider List[RDD] or some other collection.

-Kiran

On Tue, Jun 9, 2015 at 2:25 AM, ping yan <sharonyan@gmail.com> wrote:

> Hi,
>
>
> The problem I am looking at is as follows:
>
> - I read in a log file of multiple users as a RDD
>
> - I'd like to group the above RDD into *multiple RDDs* by userIds (the
> key)
>
> - my processEachUser() function then takes in each RDD mapped into
> each individual user, and calls for RDD.map or DataFrame operations on
> them. (I already had the function coded, I am therefore reluctant to work
> with the ResultIterable object coming out of rdd.groupByKey() ... )
>
> I've searched the mailing list and googled on "RDD of RDDs" and seems like
> it isn't a thing at all.
>
> A few choices left seem to be: 1) groupByKey() and then work with the
> ResultIterable object; 2) groupbyKey() and then write each group into a
> file, and read them back as individual rdds to process..
>
> Anyone got a better idea or had a similar problem before?
>
>
> Thanks!
> Ping
>
>
>
>
>
>
> --
> Ping Yan
> Ph.D. in Management
> Dept. of Management Information Systems
> University of Arizona
> Tucson, AZ 85721
>
>

Mime
View raw message