The problem I am looking at is as follows:
- I read in a log file of multiple users as a RDD
- I'd like to group the above RDD into multiple RDDs by userIds (the key)
- my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.map or DataFrame operations on them. (I already had the function coded, I am therefore reluctant to work with the ResultIterable object coming out of rdd.groupByKey() ... )
I've searched the mailing list and googled on "RDD of RDDs" and seems like it isn't a thing at all.
A few choices left seem to be: 1) groupByKey() and then work with the ResultIterable object; 2) groupbyKey() and then write each group into a file, and read them back as individual rdds to process..
Anyone got a better idea or had a similar problem before?
Ph.D. in Management
Dept. of Management Information Systems
University of Arizona
Tucson, AZ 85721