flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Flink Jira Bot (Jira)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
Date Mon, 24 May 2021 10:56:02 GMT

     [ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Flink Jira Bot updated FLINK-16468:
-----------------------------------
      Labels: auto-deprioritized-major  (was: stale-major)
    Priority: Minor  (was: Major)

This issue was labeled "stale-major" 7 ago and has not received any updates so it is being
deprioritized. If this ticket is actually Major, please raise the priority and ask a committer
to assign you the issue or revive the public discussion.


> BlobClient rapid retrieval retries on failure opens too many sockets
> --------------------------------------------------------------------
>
>                 Key: FLINK-16468
>                 URL: https://issues.apache.org/jira/browse/FLINK-16468
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.3, 1.9.2, 1.10.0
>         Environment: Linux ubuntu servers running, patch current latest Ubuntu patch
current release java 8 JRE
>            Reporter: Jason Kania
>            Priority: Minor
>              Labels: auto-deprioritized-major
>
> In situations where the BlobClient retrieval fails as in the following log, rapid retries
will exhaust the open sockets. All the retries happen within a few milliseconds.
> {noformat}
> 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - Failed to fetch
BLOB cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7
from aaa-1/10.0.1.1:45145 and store it under /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-00000004
Retrying...
> {noformat}
> The above is output repeatedly until the following error occurs:
> {noformat}
> java.io.IOException: Could not connect to BlobServer at address aaa-1/10.0.1.1:45145
>  at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100)
>  at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)
>  at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
>  at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)
>  at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
>  at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)
>  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.SocketException: Too many open files
>  at java.net.Socket.createImpl(Socket.java:478)
>  at java.net.Socket.connect(Socket.java:605)
>  at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95)
>  ... 8 more
> {noformat}
>  The retries should have some form of backoff in this situation to avoid flooding the
logs and exhausting other resources on the server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message