spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Silvio Fiorito <>
Subject Re: Spark data frame
Date Tue, 22 Dec 2015 22:20:09 GMT

collect will bring down the results to the driver JVM, whereas the RDD or DataFrame would
be cached on the executors (if it is cached). So, as Dean said, the driver JVM needs to have
enough memory to store the results of collect.


From: Michael Segel <<>>
Date: Tuesday, December 22, 2015 at 4:26 PM
To: Dean Wampler <<>>
Cc: Gaurav Agarwal <<>>, "<>"
Subject: Re: Spark data frame


RDD in memory and then the collect() resulting in a collection, where both are alive at the
same time.
(Again not sure how Tungsten plays in to this… )

So his collection can’t be larger than 1/2 of the memory allocated to the heap.

(Unless you have allocated swap…, right?)

On Dec 22, 2015, at 12:11 PM, Dean Wampler <<>>

You can call the collect() method to return a collection, but be careful. If your data is
too big to fit in the driver's memory, it will crash.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition<>

On Tue, Dec 22, 2015 at 1:09 PM, Gaurav Agarwal <<>>

We are able to retrieve data frame by filtering the rdd object . I need to convert that data
frame into java pojo. Any idea how to do that

View raw message