spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antony Mayi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-12617) socket descriptor leak killing streaming app
Date Mon, 04 Jan 2016 09:02:39 GMT

     [ https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Antony Mayi updated SPARK-12617:
--------------------------------
    Attachment: bug.py

> socket descriptor leak killing streaming app
> --------------------------------------------
>
>                 Key: SPARK-12617
>                 URL: https://issues.apache.org/jira/browse/SPARK-12617
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Streaming
>    Affects Versions: 1.5.2
>         Environment: pyspark (python 2.6)
>            Reporter: Antony Mayi
>            Priority: Critical
>         Attachments: bug.py
>
>
> There is a socket descriptor leakage in a pyspark streaming app when configured with
batch interval slightly more then 30 seconds. This is due to default timeout py4j JavaGateway
which (half) closes CallbackConnection after 30 seconds of inactivity and creates new one
next time. That connection don't get closed on the python CallbackServer side and keep piling
up until it eventually blocks new connections.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback connections
of which 0 will ever be closed
> {code}
> [BUG] py4j callback server port: 51282
> [BUG] py4j CB 0/0 closed
> ...
> [BUG] py4j CB 0/123 closed
> {code}
> * You can confirm the reality by using lsof on the pyspark driver process:
> {code}
> $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
> python2.6 39770  das   94u  IPv4 138824906      0t0       TCP localhost.localdomain:51282->localhost.localdomain:60419
(CLOSE_WAIT)
> python2.6 39770  das   95u  IPv4 138867747      0t0       TCP localhost.localdomain:51282->localhost.localdomain:60745
(CLOSE_WAIT)
> python2.6 39770  das   96u  IPv4 138831829      0t0       TCP localhost.localdomain:51282->localhost.localdomain:32849
(CLOSE_WAIT)
> python2.6 39770  das   97u  IPv4 138890524      0t0       TCP localhost.localdomain:51282->localhost.localdomain:33184
(CLOSE_WAIT)
> python2.6 39770  das   98u  IPv4 138860190      0t0       TCP localhost.localdomain:51282->localhost.localdomain:33512
(CLOSE_WAIT)
> python2.6 39770  das   99u  IPv4 138860439      0t0       TCP localhost.localdomain:51282->localhost.localdomain:33854
(CLOSE_WAIT)
> ...
> {code}
> * If you leave it running for long enough the CallbackServer will eventually become unable
to accept new connections from the gateway and the app will crash:
> {code}
> 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for time 1451711400000
ms
> py4j.Py4JException: Error while obtaining a new communication channel
> ...
> Caused by: java.net.ConnectException: Connection timed out
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at java.net.Socket.connect(Socket.java:538)
>         at java.net.Socket.<init>(Socket.java:434)
>         at java.net.Socket.<init>(Socket.java:244)
>         at py4j.CallbackConnection.start(CallbackConnection.java:104)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message