spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Handling fatal errors of executors and decommission datanodes
Date Mon, 16 Mar 2015 09:36:38 GMT
Hi,

We're facing "No space left on device" errors lately from time to time. The
job will fail after retries. Obvious in such case, retry won't be helpful.

Sure it's the problem in the datanodes but I'm wondering if Spark Driver
can handle it and decommission the problematic datanode before retrying it.
And maybe dynamically allocate another datanode if dynamic allocation is
enabled.

I think there needs to be a class of fatal errors that can't be recovered
with retries. And it's best Spark can handle it nicely.

Thanks,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Mime
View raw message