spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shixiong Zhu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-19617) Fix the race condition when starting and stopping a query quickly
Date Thu, 16 Feb 2017 19:05:41 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shixiong Zhu updated SPARK-19617:
---------------------------------
    Description: 
The streaming thread in StreamExecution uses the following ways to check if it should exit:
- Catch an InterruptException.
- `StreamExecution.state` is TERMINATED.

when starting and stopping a query quickly, the above two checks may both fail.
- Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and swallow InterruptException
- StreamExecution.stop is called before `state` becomes `ACTIVE`. Then [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252]
changes the state from `TERMINATED` to `ACTIVE`.

If the above cases both happen, the query will hang forever.

  was:
Saw the following exception in some test log:
{code}
17/02/14 21:20:10.987 stream execution thread for this_query [id = 09fd5d6d-bea3-4891-88c7-0d0f1909188d,
runId = a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining on: Thread[Thread-48,5,main]
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1249)
	at java.lang.Thread.join(Thread.java:1323)
	at org.apache.hadoop.util.Shell.joinThread(Shell.java:626)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:577)
	at org.apache.hadoop.util.Shell.run(Shell.java:479)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532)
	at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509)
	at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066)
	at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176)
	at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197)
	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730)
	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733)
	at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385)
	at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.<init>(HDFSMetadataLog.scala:75)
	at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.<init>(CompactibleFileStreamLog.scala:46)
	at org.apache.spark.sql.execution.streaming.FileStreamSourceLog.<init>(FileStreamSourceLog.scala:36)
	at org.apache.spark.sql.execution.streaming.FileStreamSource.<init>(FileStreamSource.scala:59)
	at org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
	at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141)
	at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136)
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191)
{code}

This is the cause of some test timeout failures on Jenkins.


> Fix the race condition when starting and stopping a query quickly
> -----------------------------------------------------------------
>
>                 Key: SPARK-19617
>                 URL: https://issues.apache.org/jira/browse/SPARK-19617
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Shixiong Zhu
>            Assignee: Shixiong Zhu
>
> The streaming thread in StreamExecution uses the following ways to check if it should
exit:
> - Catch an InterruptException.
> - `StreamExecution.state` is TERMINATED.
> when starting and stopping a query quickly, the above two checks may both fail.
> - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and swallow InterruptException
> - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252]
changes the state from `TERMINATED` to `ACTIVE`.
> If the above cases both happen, the query will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message