spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26175) PySpark cannot terminate worker process if user program reads from stdin
Date Mon, 03 Dec 2018 11:24:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707021#comment-16707021
] 

Hyukjin Kwon commented on SPARK-26175:
--------------------------------------

[~mengxr], BTW, can't this be done by {{pipe}}?

> PySpark cannot terminate worker process if user program reads from stdin
> ------------------------------------------------------------------------
>
>                 Key: SPARK-26175
>                 URL: https://issues.apache.org/jira/browse/SPARK-26175
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Ala Luszczak
>            Priority: Major
>              Labels: Hydrogen
>
> PySpark worker daemon reads from stdin the worker PIDs to kill. https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127
> However, the worker process is a forked process from the worker daemon process and we
didn't close stdin on the child after fork. This means the child and user program can read
stdin as well, which blocks daemon from receiving the PID to kill. This can cause issues because
the task reaper might detect the task was not terminated and eventually kill the JVM.
> Possible fix could be:
> * Closing stdin of the worker process right after fork.
> * Creating a new socket to receive PIDs to kill instead of using stdin.
> h4. Steps to reproduce
> # Paste the following code in pyspark:
> {code}
> import subprocess
> def task(_):
>   subprocess.check_output(["cat"])
> sc.parallelize(range(1), 1).mapPartitions(task).count()
> {code}
> # Press CTRL+C to cancel the job.
> # The following message is displayed:
> {code}
> 18/11/26 17:52:51 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) interrupted:
Attempting to kill Python Worker
> 18/11/26 17:52:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost,
executor driver): TaskKilled (Stage cancelled)
> {code}
> # Run {{ps -xf}} to see that {{cat}} process was in fact not killed:
> {code}
> 19773 pts/2    Sl+    0:00  |   |   \_ python
> 19803 pts/2    Sl+    0:11  |   |       \_ /usr/lib/jvm/java-8-oracle/bin/java -cp /home/ala/Repos/apache-spark-GOOD-2/conf/:/home/ala/Repos/apache-spark-GOOD-2/assembly/target/scala-2.12/jars/*
-Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell pyspark-shell
> 19879 pts/2    S      0:00  |   |           \_ python -m pyspark.daemon
> 19895 pts/2    S      0:00  |   |               \_ python -m pyspark.daemon
> 19898 pts/2    S      0:00  |   |                   \_ cat
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message