AFAIK Spark assumes infinite disk space, so there isn't really a way to limit how much space it uses. Unfortunately I'm not aware of a simpler workaround than to simply provision your cluster with more disk space. By the way, are you sure that it's disk space that exceeded the limit, but not the number of inodes? If it's the latter, maybe you could control the ulimit of the container.
To answer your other question: if it can't persist to disk then yes it will fail. It will only recompute from the data source if for some reason someone evicted our blocks from memory, but that shouldn't happen in your case since your'e using MEMORY_AND_DISK_SER.
2015-07-10 3:51 GMT-07:00 Peter Rudenko <firstname.lastname@example.org>:
Hi, i have a spark ML worklflow. It uses some persist calls. When i launch it with 1 tb dataset - it puts down all cluster, becauses it fills all disk space at /yarn/nm/usercache/root/appcache: http://i.imgur.com/qvRUrOp.png
I found a yarn settings:
yarn.nodemanager.localizer.cache.target-size-mb - Target size of localizer cache in MB, per nodemanager. It is a target retention size that only includes resources with PUBLIC and PRIVATE visibility and excludes resources with APPLICATION visibility
But it excludes resources with APPLICATION visibility, and spark cache as i understood is of APPLICATION type.
Is it possible to restrict a disk space for spark application? Will spark fail if it wouldn't be able to persist on disk (StorageLevel.MEMORY_AND_DISK_SER) or it would recompute from data source?