spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <>
Subject Re: Visitor function to RDD elements
Date Tue, 22 Oct 2013 22:48:40 GMT
Thanks everyone – I think we're going to go with collect() and kick out things that attempt
to obtain overly large sets.

However, I think my original concern still stands. Some reading online shows that Microsoft
Excel, for example, supports displaying something on the order of 2-4 GB sized spreadsheets
If there is a 2GB RDD however streaming it all back to the driver seems wasteful where in
reality we could fetch chunks of it at a time and load only parts in driver memory, as opposed
to using 2GB of RAM on the driver. In fact I don't know what the maximum frame size that can
be set would be via spark.akka.framesize.

-Matt Cheah

From: Mark Hamstra <<>>
Reply-To: "<>"
Date: Tuesday, October 22, 2013 3:32 PM
To: user <<>>
Subject: Re: Visitor function to RDD elements

Correct; that's the completely degenerate case where you can't do anything in parallel.  Often
you'll also want your iterator function to send back some information to an accumulator (perhaps
just the result calculated with the last element of the partition) which is then fed back
into the operation on the next partition as either a broadcast variable or part of the closure.

On Tue, Oct 22, 2013 at 3:25 PM, Nathan Kronenfeld <<>>
You shouldn't have to fly data around

You can just run it first on partition 0, then on partition 1, etc...  I may have the name
slightly off, but something approximately like:

for (p <- 0 until numPartitions)
  data.mapPartitionsWithIndex((i, iter) => if (0 == p) else List().iterator)

should work... BUT that being said, you've now really lost the point of using Spark to begin

View raw message