Ah... I think you're right about the flatMap then :). Or you could use mapPartitions. (I'm not sure if it makes a difference.)

On Mon, Dec 8, 2014 at 10:09 PM, Steve Lewis <lordjoe2000@gmail.com> wrote:
looks good but how do I say that in Java
as far as I can see sc.parallelize (in Java)  has only one implementation which takes a List - requiring an in memory representation 

On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos <daniel.darabos@lynxanalytics.com> wrote:
Hi,
I think you have the right idea. I would not even worry about flatMap.

val rdd = sc.parallelize(1 to 1000000, numSlices = 1000).map(x => generateRandomObject(x))

Then when you try to evaluate something on this RDD, it will happen partition-by-partition. So 1000 random objects will be generated at a time per executor thread.

On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis <lordjoe2000@gmail.com> wrote:
 I have a function which generates a Java object and I want to explore failures which only happen when processing large numbers of these object. the real code is reading a many gigabyte file but in the test code I can generate similar objects programmatically. I could create a small list, parallelize it and then use flatmap to inflate it several times by a factor of 1000 (remember I can hold a list of 1000 items in memory but not a million) 
Are there better ideas - remember I want to create more objects than can be held in memory at once.





--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com