spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ping yan <>
Subject RDD of RDDs
Date Mon, 08 Jun 2015 20:55:53 GMT

The problem I am looking at is as follows:

- I read in a log file of multiple users as a RDD

- I'd like to group the above RDD into *multiple RDDs* by userIds (the key)

- my processEachUser() function then takes in each RDD mapped into
each individual user, and calls for or DataFrame operations on
them. (I already had the function coded, I am therefore reluctant to work
with the ResultIterable object coming out of rdd.groupByKey() ... )

I've searched the mailing list and googled on "RDD of RDDs" and seems like
it isn't a thing at all.

A few choices left seem to be: 1) groupByKey() and then work with the
ResultIterable object; 2) groupbyKey() and then write each group into a
file, and read them back as individual rdds to process..

Anyone got a better idea or had a similar problem before?


Ping Yan
Ph.D. in Management
Dept. of Management Information Systems
University of Arizona
Tucson, AZ 85721

View raw message