spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "octavian.ganea" <>
Subject flatMap output on disk / flatMap memory overhead
Date Mon, 01 Jun 2015 17:32:17 GMT

Is there any way to force the output RDD of a  flatMap op to be stored in
both memory and disk as it is computed ? My RAM would not be able to fit the
entire output of flatMap, so it really needs to starts using disk after the
RAM gets full. I didn't find any way to force this. 

Also, what is the memory overhead of flatMap ? From my computations, the
output RDD should fit in memory, but I get the following error after a while
(and I know it's because of memory issues, since running the program with
1/3 of the input data finishes succesfully)

15/06/01 19:02:49 ERROR BlockFetcherIterator$BasicBlockFetcherIterator:
Could not get block(s) from
ConnectionManagerId(,57478) sendMessageReliably failed because ack was not received
within 60 sec
	at scala.Option.foreach(Option.scala:236)
	at io.netty.util.HashedWheelTimer$

Also, I've seen also this:
but my understanding is that one should apply something like:
rdd.flatMap(...).persist(MEMORY_AND_DISK) which assumes that the entire
output of flatMap is first stored in memory (which is not possible in my
case) and, only when it's done, is stored on the disk. Please correct me if
I'm wrong.  Anways, I've tried using this , but I got the same error.

My config:

    conf.set("spark.cores.max", "128")
    conf.set("spark.akka.frameSize", "1024")
    conf.set("spark.executor.memory", "125g")
    conf.set("spark.shuffle.file.buffer.kb", "1000")
    conf.set("spark.shuffle.consolidateFiles", "true")

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message