spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <>
Subject How can I force operations to complete and spool to disk
Date Thu, 07 May 2015 05:16:54 GMT
I am performing a job where I perform a number of steps in succession.
One step is a map on a JavaRDD which generates objects taking up
significant memory.
The this is followed by a join and an aggregateByKey.
The problem is that the system is running getting OutOfMemoryErrors -
Most tasks work but a few fail. Tasks typically preform both operations
several hundred thousand times.
I am convinced things would work if the map ran to completion and shuffled
results to disk before starting the aggregateByKey.
I tried calling persist and then count on the results of the map to force
execution but this does not seem to help. Smaller partitions might also
help if these could be forced.
Any ideas?

View raw message