flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Flink Jira Bot (Jira)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets
Date Mon, 24 May 2021 10:56:02 GMT

     [ https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Flink Jira Bot updated FLINK-16468:
      Labels: auto-deprioritized-major  (was: stale-major)
    Priority: Minor  (was: Major)

This issue was labeled "stale-major" 7 ago and has not received any updates so it is being
deprioritized. If this ticket is actually Major, please raise the priority and ask a committer
to assign you the issue or revive the public discussion.

> BlobClient rapid retrieval retries on failure opens too many sockets
> --------------------------------------------------------------------
>                 Key: FLINK-16468
>                 URL: https://issues.apache.org/jira/browse/FLINK-16468
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.3, 1.9.2, 1.10.0
>         Environment: Linux ubuntu servers running, patch current latest Ubuntu patch
current release java 8 JRE
>            Reporter: Jason Kania
>            Priority: Minor
>              Labels: auto-deprioritized-major
> In situations where the BlobClient retrieval fails as in the following log, rapid retries
will exhaust the open sockets. All the retries happen within a few milliseconds.
> {noformat}
> 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - Failed to fetch
BLOB cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7
from aaa-1/ and store it under /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-00000004
> {noformat}
> The above is output repeatedly until the following error occurs:
> {noformat}
> java.io.IOException: Could not connect to BlobServer at address aaa-1/
>  at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100)
>  at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)
>  at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
>  at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)
>  at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
>  at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)
>  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.SocketException: Too many open files
>  at java.net.Socket.createImpl(Socket.java:478)
>  at java.net.Socket.connect(Socket.java:605)
>  at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95)
>  ... 8 more
> {noformat}
>  The retries should have some form of backoff in this situation to avoid flooding the
logs and exhausting other resources on the server.

This message was sent by Atlassian Jira

View raw message