Thanks Shixiong!

Very strange that our tasks were retried on the same executor again and again. I'll check spark.scheduler.executorTaskBlacklistTime.

Jianshi

On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu <zsxwing@gmail.com> wrote:
There are 2 cases for "No space left on device":

1. Some tasks which use large temp space cannot run in any node.
2. The free space of datanodes is not balance. Some tasks which use large temp space can not run in several nodes, but they can run in other nodes successfully.

Because most of our cases are the second one, we set "spark.scheduler.executorTaskBlacklistTime" to 30000 to solve such "No space left on device" errors. So if a task runs unsuccessfully in some executor, it won't be scheduled to the same executor in 30 seconds.


Best Regards,

Shixiong Zhu

2015-03-16 17:40 GMT+08:00 Jianshi Huang <jianshi.huang@gmail.com>:

On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang <jianshi.huang@gmail.com> wrote:
Hi,

We're facing "No space left on device" errors lately from time to time. The job will fail after retries. Obvious in such case, retry won't be helpful.

Sure it's the problem in the datanodes but I'm wondering if Spark Driver can handle it and decommission the problematic datanode before retrying it. And maybe dynamically allocate another datanode if dynamic allocation is enabled.

I think there needs to be a class of fatal errors that can't be recovered with retries. And it's best Spark can handle it nicely.

Thanks,
--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/



--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/




--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/