flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maximilian Michels (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2865) OutOfMemory error (Direct buffer memory)
Date Wed, 21 Oct 2015 09:52:27 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966567#comment-14966567
] 

Maximilian Michels commented on FLINK-2865:
-------------------------------------------

There is no limit anymore. As long as we don't take the network memory into account we shouldn't
set a hard direct memory limit. Otherwise, we may suffer from direct memory exhaustion even
when the direct memory limit is set to {{java.heap.mb}}. The previous attempt to calculate
exact direct memory usage failed because Netty's internal buffer pool allocation logic is
not so easy to predict. When we figure this out (I was talking to [~uce]), then we can revert
to a hard direct memory limit like we have for the heap memory.

> OutOfMemory error (Direct buffer memory)
> ----------------------------------------
>
>                 Key: FLINK-2865
>                 URL: https://issues.apache.org/jira/browse/FLINK-2865
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>    Affects Versions: 0.10
>            Reporter: Greg Hogan
>            Assignee: Maximilian Michels
>             Fix For: 0.10
>
>
> I see the following TaskManager error when using off-heap memory and a relatively high
number of network buffers. Setting {{taskmanager.memory.off-heap: false}} or halving the number
of network buffers (6 GB instead of 12 GB) results in a successful start.
> {noformat}
> 18:17:25,912 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable
to load native-hadoop library for your platform... using builtin-java classes where applicable
> 18:17:26,024 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
> 18:17:26,024 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting
TaskManager (Version: 0.10-SNAPSHOT, Rev:d047ddb, Date:18.10.2015 @ 08:54:59 UTC)
> 18:17:26,025 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current
user: ec2-user
> 18:17:26,025 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM:
Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.60-b23
> 18:17:26,025 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum
heap size: 5104 MiBytes
> 18:17:26,025 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME:
/usr/java/latest
> 18:17:26,026 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop
version: 2.3.0
> 18:17:26,026 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM
Options:
> 18:17:26,026 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  -Xms5325M
> 18:17:26,026 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  -Xmx5325M
> 18:17:26,026 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  -XX:MaxDirectMemorySize=53248M
> 18:17:26,026 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  -Dlog.file=/home/ec2-user/flink/log/flink-ec2-user-taskmanager-0-ip-10-0-98-3.log
> 18:17:26,027 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  -Dlog4j.configuration=file:/home/ec2-user/flink/conf/log4j.properties
> 18:17:26,027 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  -Dlogback.configurationFile=file:/home/ec2-user/flink/conf/logback.xml
> 18:17:26,027 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program
Arguments:
> 18:17:26,027 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  --configDir
> 18:17:26,027 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  /home/ec2-user/flink/conf
> 18:17:26,027 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  --streamingMode
> 18:17:26,027 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  
  batch
> 18:17:26,027 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
> 18:17:26,033 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum
number of open file descriptors is 1048576
> 18:17:26,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading
configuration from /home/ec2-user/flink/conf
> 18:17:26,079 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security
is not enabled. Starting non-authenticated TaskManager.
> 18:17:26,094 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying
to select the network interface and address to use by connecting to the leading JobManager.
> 18:17:26,094 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager
will try to connect for 10000 milliseconds before falling back to heuristics
> 18:17:26,097 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Retrieved
new target address /127.0.0.1:6123.
> 18:17:26,461 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager
will use hostname/address 'ip-10-0-98-3' (10.0.98.3) for communication.
> 18:17:26,462 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting
TaskManager in streaming mode BATCH_ONLY
> 18:17:26,462 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting
TaskManager actor system at 10.0.98.3:0
> 18:17:26,735 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger
started
> 18:17:26,767 INFO  Remoting                                                      - Starting
remoting
> 18:17:26,877 INFO  Remoting                                                      - Remoting
started; listening on addresses :[akka.tcp://flink@10.0.98.3:47484]
> 18:17:26,881 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting
TaskManager actor
> 18:17:26,925 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig
[server address: ip-10-0-98-3/10.0.98.3, server port: 45728, memory segment size (bytes):
32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client
threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client
connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
> 18:17:26,927 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages
between TaskManager and JobManager have a max timeout of 100000 milliseconds
> 18:17:26,931 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary
file directory '/volumes/xvdb/tmp': total 319 GB, usable 319 GB (100.00% usable)
> 18:17:26,931 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary
file directory '/volumes/xvdc/tmp': total 319 GB, usable 319 GB (100.00% usable)
> 18:17:32,194 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated
12288 MB for network buffer pool (number of memory segments: 393216, bytes per segment: 32768).
> 18:17:32,195 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using
0.9 of the maximum memory size for Flink managed off-heap memory (45940 MB).
> 18:17:50,371 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Error
while starting up taskManager
> java.lang.Exception: OutOfMemory error (Direct buffer memory) while allocating the TaskManager
off-heap memory (48172092966 bytes). Try increasing the maximum direct memory (-XX:MaxDirectMemorySize)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:1633)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.runTaskManager(TaskManager.scala:1460)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1325)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1235)
> 	at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
> Caused by: java.lang.OutOfMemoryError: Direct buffer memory
> 	at java.nio.Bits.reserveMemory(Bits.java:658)
> 	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> 	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> 	at org.apache.flink.runtime.memory.MemoryManager$HybridOffHeapMemoryPool.<init>(MemoryManager.java:661)
> 	at org.apache.flink.runtime.memory.MemoryManager.<init>(MemoryManager.java:166)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:1618)
> 	... 4 more
> 18:17:50,374 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed
to run TaskManager.
> java.lang.Exception: OutOfMemory error (Direct buffer memory) while allocating the TaskManager
off-heap memory (48172092966 bytes). Try increasing the maximum direct memory (-XX:MaxDirectMemorySize)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:1633)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.runTaskManager(TaskManager.scala:1460)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1325)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1235)
> 	at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
> Caused by: java.lang.OutOfMemoryError: Direct buffer memory
> 	at java.nio.Bits.reserveMemory(Bits.java:658)
> 	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> 	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> 	at org.apache.flink.runtime.memory.MemoryManager$HybridOffHeapMemoryPool.<init>(MemoryManager.java:661)
> 	at org.apache.flink.runtime.memory.MemoryManager.<init>(MemoryManager.java:166)
> 	at org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:1618)
> 	... 4 more
> {noformat}
> {noformat}
> ################################################################################
> #  Licensed to the Apache Software Foundation (ASF) under one
> #  or more contributor license agreements.  See the NOTICE file
> #  distributed with this work for additional information
> #  regarding copyright ownership.  The ASF licenses this file
> #  to you under the Apache License, Version 2.0 (the
> #  "License"); you may not use this file except in compliance
> #  with the License.  You may obtain a copy of the License at
> #
> #      http://www.apache.org/licenses/LICENSE-2.0
> #
> #  Unless required by applicable law or agreed to in writing, software
> #  distributed under the License is distributed on an "AS IS" BASIS,
> #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> #  See the License for the specific language governing permissions and
> # limitations under the License.
> ################################################################################
> jobmanager.web.history: 50
> taskmanager.debug.memory.startLogThread: true
> taskmanager.debug.memory.logIntervalMs: 1000
> taskmanager.memory.fraction: 0.9
> taskmanager.memory.off-heap: true
> taskmanager.runtime.hashjoin-bloom-filters: true
> taskmanager.runtime.max-fan: 1024
> #==============================================================================
> # Common
> #==============================================================================
> # The host on which the JobManager runs. Only used in non-high-availability mode.
> # The JobManager process will use this hostname to bind the listening servers to.
> # The TaskManagers will try to connect to the JobManager on that host.
> jobmanager.rpc.address: localhost
> # The port where the JobManager's main actor system listens for messages.
> jobmanager.rpc.port: 6123
> # The heap size for the JobManager JVM
> jobmanager.heap.mb: 1024
> # The heap size for the TaskManager JVM
> taskmanager.heap.mb: 53248
> # The number of task slots that each TaskManager offers. Each slot runs one parallel
pipeline.
> taskmanager.numberOfTaskSlots: 32
> # The parallelism used for programs that did not specify and other parallelism.
> parallelism.default: 32
> #==============================================================================
> # Web Frontend
> #==============================================================================
> # The port under which the web-based runtime monitor listens.
> # A value of -1 deactivates the web server.
> jobmanager.web.port: 8081
> # The port uder which the standalone web client
> # (for job upload and submit) listens.
> webclient.port: 8080
> # Temporary: Uncomment this to be able to use the new web frontend
> jobmanager.new-web-frontend: true
> #==============================================================================
> # Streaming state checkpointing
> #==============================================================================
> # The backend that will be used to store operator state checkpoints if 
> # checkpointing is enabled. 
> #
> # Supported backends: jobmanager, filesystem
> state.backend: jobmanager
> # Directory for storing checkpoints in a flink supported filesystem
> # Note: State backend must be accessible from the JobManager, use file://
> # only for local setups. 
> #
> # state.backend.fs.checkpointdir: hdfs://checkpoints
> #==============================================================================
> # Advanced
> #==============================================================================
> # The number of buffers for the network stack.
> taskmanager.network.numberOfBuffers: 393216
> # Directories for temporary files.
> #
> # Add a delimited list for multiple directories, using the system directory
> # delimiter (colon ':' on unix) or a comma, e.g.:
> #     /data1/tmp:/data2/tmp:/data3/tmp
> #
> # Note: Each directory entry is read from and written to by a different I/O
> # thread. You can include the same directory multiple times in order to create
> # multiple I/O threads against that directory. This is for example relevant for
> # high-throughput RAIDs.
> #
> # If not specified, the system-specific Java temporary directory (java.io.tmpdir
> # property) is taken.
> taskmanager.tmp.dirs: /volumes/xvdb/tmp:/volumes/xvdc/tmp
> # Path to the Hadoop configuration directory.
> #
> # This configuration is used when writing into HDFS. Unless specified otherwise,
> # HDFS file creation will use HDFS default settings with respect to block-size,
> # replication factor, etc.
> #
> # You can also directly specify the paths to hdfs-default.xml and hdfs-site.xml
> # via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'.
> #
> # fs.hdfs.hadoopconf: /path/to/hadoop/conf/
> #==============================================================================
> # High Availability
> #==============================================================================
> # The list of ZooKepper quorum peers that coordinate the high-availability
> # setup. This must be a list of the form
> # "host_1[:peerPort[:leaderPort]],host_2[:peerPort[:leaderPort]],..."
> #
> # recovery.mode: zookeeper
> #
> # ha.zookeeper.quorum: localhost
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message