spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Darabos <daniel.dara...@lynxanalytics.com>
Subject Re: How can I create an RDD with millions of entries created programmatically
Date Tue, 09 Dec 2014 09:07:53 GMT
Ah... I think you're right about the flatMap then :). Or you could use
mapPartitions. (I'm not sure if it makes a difference.)

On Mon, Dec 8, 2014 at 10:09 PM, Steve Lewis <lordjoe2000@gmail.com> wrote:

> looks good but how do I say that in Java
> as far as I can see sc.parallelize (in Java)  has only one implementation
> which takes a List - requiring an in memory representation
>
> On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos <
> daniel.darabos@lynxanalytics.com> wrote:
>
>> Hi,
>> I think you have the right idea. I would not even worry about flatMap.
>>
>> val rdd = sc.parallelize(1 to 1000000, numSlices = 1000).map(x =>
>> generateRandomObject(x))
>>
>> Then when you try to evaluate something on this RDD, it will happen
>> partition-by-partition. So 1000 random objects will be generated at a time
>> per executor thread.
>>
>> On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis <lordjoe2000@gmail.com>
>> wrote:
>>
>>>  I have a function which generates a Java object and I want to explore
>>> failures which only happen when processing large numbers of these object.
>>> the real code is reading a many gigabyte file but in the test code I can
>>> generate similar objects programmatically. I could create a small list,
>>> parallelize it and then use flatmap to inflate it several times by a factor
>>> of 1000 (remember I can hold a list of 1000 items in memory but not a
>>> million)
>>> Are there better ideas - remember I want to create more objects than can
>>> be held in memory at once.
>>>
>>>
>>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>

Mime
View raw message