spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Darabos <daniel.dara...@lynxanalytics.com>
Subject Re: How can I create an RDD with millions of entries created programmatically
Date Mon, 08 Dec 2014 20:06:20 GMT
Hi,
I think you have the right idea. I would not even worry about flatMap.

val rdd = sc.parallelize(1 to 1000000, numSlices = 1000).map(x =>
generateRandomObject(x))

Then when you try to evaluate something on this RDD, it will happen
partition-by-partition. So 1000 random objects will be generated at a time
per executor thread.

On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis <lordjoe2000@gmail.com> wrote:

>  I have a function which generates a Java object and I want to explore
> failures which only happen when processing large numbers of these object.
> the real code is reading a many gigabyte file but in the test code I can
> generate similar objects programmatically. I could create a small list,
> parallelize it and then use flatmap to inflate it several times by a factor
> of 1000 (remember I can hold a list of 1000 items in memory but not a
> million)
> Are there better ideas - remember I want to create more objects than can
> be held in memory at once.
>
>

Mime
View raw message