flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sayat Satybaldiyev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9831) Too many open files for RocksDB
Date Thu, 12 Jul 2018 13:15:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541610#comment-16541610
] 

Sayat Satybaldiyev commented on FLINK-9831:
-------------------------------------------

[~srichter] I think I found an issue. I did
{code:java}
lsof -p FLINK_TM_PID | wc -l{code}
and it give 8240 open file descriptors. Then I've looked in into the limits of the FLINK_TM_PID
in the /proc directory it gives the following output:

 

 
{code:java}
Limit          Soft Limit Hard Limit Units 
Max open files 8192       8192       files
{code}
 

My guess, that Flink TM has overridden _max_open_file_ during startup and set the custom
one.  This might be related to this rocksdb issue [https://github.com/facebook/rocksdb/issues/457.]

Question: how can I find which value Flinks uses by default with RocksDB? The second on, how
can(or should) I override this limit?

 

> Too many open files for RocksDB
> -------------------------------
>
>                 Key: FLINK-9831
>                 URL: https://issues.apache.org/jira/browse/FLINK-9831
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.5.0
>            Reporter: Sayat Satybaldiyev
>            Priority: Major
>         Attachments: flink_open_files.txt
>
>
> While running only one Flink job, which is backed by RocksDB with checkpoining to HDFS
we encounter an exception that TM cannot access the SST file because the process has too
many open files. However, we have already increased the file soft/hard limit on the machine.
> Number open files for TM on the machine:
>  
> {code:java}
> lsof -p 23301|wc -l
> 8241{code}
>  
> Instance limits
>  
> {code:java}
> ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size (blocks, -f) unlimited
> pending signals (-i) 256726
> max locked memory (kbytes, -l) 64
> max memory size (kbytes, -m) unlimited
> open files (-n) 1048576
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority (-r) 0
> stack size (kbytes, -s) 8192
> cpu time (seconds, -t) unlimited
> max user processes (-u) 128000
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>  
> {code}
>  
> [^flink_open_files.txt]
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:227)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:730)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:295)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend
for KeyedCoProcessOperator_98a16ed3228ec4a08acd8d78420516a1_(1/1) from any of the 1 provided
restore options.
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:276)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:132)
> 	... 5 more
> Caused by: java.io.FileNotFoundException: /tmp/flink-io-3da06c9e-f619-44c9-b95f-54ee9b1a084a/job_b3ecbdc0eb2dc2dfbf5532ec1fcef9da_op_KeyedCoProcessOperator_98a16ed3228ec4a08acd8d78420516a1__1_1__uuid_c4b82a7e-8a04-4704-9e0b-393c3243cef2/3701639a-bacd-4861-99d8-5f3d112e88d6/000016.sst
(Too many open files)
> 	at java.io.FileOutputStream.open0(Native Method)
> 	at java.io.FileOutputStream.open(FileOutputStream.java:270)
> 	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
> 	at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
> 	at org.apache.flink.core.fs.local.LocalDataOutputStream.<init>(LocalDataOutputStream.java:47)
> 	at org.apache.flink.core.fs.local.LocalFileSystem.create(LocalFileSystem.java:275)
> 	at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.create(SafetyNetWrapperFileSystem.java:121)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.copyStateDataHandleData(RocksDBKeyedStateBackend.java:1008)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.transferAllDataFromStateHandles(RocksDBKeyedStateBackend.java:988)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.transferAllStateDataToDirectory(RocksDBKeyedStateBackend.java:973)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstance(RocksDBKeyedStateBackend.java:758)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:732)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:443)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:149)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
> 	... 7 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message