spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mridul Muralidharan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17064) Reconsider spark.job.interruptOnCancel
Date Tue, 04 Oct 2016 23:44:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15547073#comment-15547073
] 

Mridul Muralidharan commented on SPARK-17064:
---------------------------------------------


I agree, interrupt'ing a thread can (and usually does) have unintended side effects.
On plus side, our current model does mean that any actual processing of tuple will cause the
task to get killed ...

Perhaps we can check this state in shuffle fetch as well to further avoid expensive prefix
operations (data fetch, etc) ?
This will ofcourse not catch the case of user code doing something extremely long running
and expensive while processing a single tuple - but then there is no gaurantee thread interruption
will work in that case anyway.

> Reconsider spark.job.interruptOnCancel
> --------------------------------------
>
>                 Key: SPARK-17064
>                 URL: https://issues.apache.org/jira/browse/SPARK-17064
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>            Reporter: Mark Hamstra
>
> There is a frequent need or desire in Spark to cancel already running Tasks.  This has
been recognized for a very long time (see, e.g., the ancient TODO comment in the DAGScheduler:
"Cancel running tasks in the stage"), but we've never had more than an incomplete solution.
 Killing running Tasks at the Executor level has been implemented by interrupting the threads
running the Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3
addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task threads in
this way has only been possible if interruptThread is true, and that typically comes from
the setting of the interruptOnCancel property in the JobGroup, which in turn typically comes
from the setting of spark.job.interruptOnCancel.  Because of concerns over https://issues.apache.org/jira/browse/HDFS-1208
and the possibility of nodes being marked dead when a Task thread is interrupted, the default
value of the boolean has been "false" -- i.e. by default we do not interrupt Tasks already
running on Executor even when the Task has been canceled in the DAGScheduler, or the Stage
has been abort, or the Job has been killed, etc.
> There are several issues resulting from this current state of affairs, and they each
probably need to spawn their own JIRA issue and PR once we decide on an overall strategy here.
 Among those issues:
> * Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS versions
that Spark now supports so that we can set the default value of spark.job.interruptOnCancel
to "true" or eliminate this boolean flag entirely?
> * Even if interrupting Task threads is no longer an issue for HDFS, is it still enough
of an issue for non-HDFS usage (e.g. Cassandra) so that we still need protection similar to
what the current default value of spark.job.interruptOnCancel provides?
> * If interrupting Task threads isn't safe enough, what should we do instead?
> * Once we have a safe mechanism to stop and clean up after already executing Tasks, there
is still the question of whether we _should_ end executing Tasks.  While that is likely a
good thing to do in cases where individual Tasks are lightweight in terms of resource usage,
at least in some cases not all running Tasks should be ended: https://github.com/apache/spark/pull/12436
 That means that we probably need to continue to make allowing Task interruption configurable
at the Job or JobGroup level (and we need better documentation explaining how and when to
allow interruption or not.)
> * There is one place in the current code (TaskSetManager#handleSuccessfulTask) that hard
codes interruptThread to "true".  This should be fixed, and similar misuses of killTask be
denied in pull requests until this issue is adequately resolved.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message