spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Gerber <thomas.ger...@radius.com>
Subject Possible Reasons for BlockNotFoundException
Date Tue, 27 Jan 2015 22:25:45 GMT
Hello,

We run large jobs on a 50 machines cluster on Amazon with Spark Standalone
(NOT yarn). It works great.

Every once in a while, I see a few tasks (less than 1/1000) that are failed
on BlockNotFoundException. Those are retried successfully.

On the example attached, we have a stage which takes a cached (memory
deserialized) RDD and does some map, filter and flatmap transformations on
it (no shuffle). You will see the 5 failures and their successfully retried
counterparts.

My question is pretty broad, but I wonder what could the cause(s) of those
failures be. Is it usually caused by network blips? Or could it be caused
by caching?

I ask because the original cached RDD is probably not entirely in memory. I
can tell because in my list of tasks in that stage, in the input column, I
see tasks grabbing from "network" and "hadoop", not just "memory". Could
those fail task happen when a cached block has just been evicted from
memory?

Thanks!
Thomas

PS: I would also be interested what "network" means as an input; is it
cached data from another worker node memory? I notice that tasks grabbing
from "network" happen at towards the end of the stage; maybe those are
executors that had no more PROCESS_LOCAL tasks to run?

Mime
View raw message