flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward <egb...@hotmail.com>
Subject Classloader leak with Kafka Mbeans / JMX Reporter
Date Tue, 06 Feb 2018 20:30:36 GMT
I'm having an issue where off-heap memory is growing unchecked until I get
OOM exceptions.

I was hoping that upgrading to 1.4 would solve these, since the child-first
classloader is supposed to resolve issues with Avro classes cached in a
different classloader (which prevents the classloaders from being garbage
collected).

However, after upgrading, we are still seeing an off-heap memory leak. I
think I may have isolated the issue to the JmxReporter class used for
collecting Kafka metrics.

Here are the details of what I'm seeing:
Our cluster is running in kubernetes, using the latest flink:1.4 docker
image. We are using the default classloading order (child first).

If I resubmit my job repeatedly, the ClassLoaders from the previous job
submissions don't get cleaned up, and the non-heap memory slowly grows until
the task manager runs out of memory. 

I can see all of the un-deleted classloaders if I run "sudo -u flink jmap
-clstats <proc_id>" (the output is below). This list of dead classloaders
continues to grow every time I kill and resubmit a new Flink job.  In all,
it lists 3200 dead class loaders. I'm only going to upload the ones which
show more than 2K of used memory.

finding class loader instances ..done.
computing per loader stat ..done.
please wait.. computing liveness.liveness analysis may be inaccurate ...
class_loader	classes	bytes	parent_loader	alive?	type
0x00000000807302a0	7522	12213076	0x00000000804c58c0	dead
sun/misc/Launcher$AppClassLoader@0x000000010000f070
0x000000008eb00000	3699	6021535	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x0000000094200190	3693	6016807	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x000000009e7bc6c8	3696	6001081	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x00000000a9d80008	3584	5530412	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x00000000f4103650	3581	5527354	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x00000000901801f8	3581	5527354	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x00000000942637c0	3231	5121176	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x0000000096c2ec00	3231	5119662	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x000000008f600000	3225	5116241	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x0000000092700d48	3228	5112270	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
<bootstrap>	2548	4424440	null	live	<internal>
0x0000000096b77190	2234	3634602	0x00000000807302a0	dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98

Next I took a heap dump:
sudo -u flink jmap -dump:format=b,file=/tmp/HeapDump.hprof <procpid>

Then, using Eclipse Memory Analyzer, I followed the steps from this blog
post:
http://java.jiderhamn.se/2011/12/11/classloader-leaks-i-how-to-find-classloader-leaks-with-eclipse-memory-analyser-mat/

The result of looking for strong references to classes in a dead classloader
is this tree:

Class Name                                                                               
                 
| Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------------------------------------------------
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader
@ 0x94200190|           88 |       616,992
'- <classloader> class
org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @ 0x94250cb0                 
|            0 |             0
   '- <class> org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @
0xbae537e8                           |           24 |           328
      '- object com.sun.jmx.mbeanserver.NamedObject @ 0xbace01e0                         
                 
|           24 |            24
         '- value java.util.HashMap$Node @ 0xbace0110                                    
                 
|           32 |           232
            '- [247] java.util.HashMap$Node[512] @ 0xfa0d08c0                            
                 
|        2,064 |       120,104
               '- table java.util.HashMap @ 0x806e9f08                                   
                 
|           48 |       120,152
                  '- value java.util.HashMap$Node @ 0x806e9ee8                           
                 
|           32 |       120,184
                     '- [8] java.util.HashMap$Node[16] @ 0x80502da0                      
                 
|           80 |       134,944
                        '- table java.util.HashMap @ 0x80502d70                          
                 
|           48 |       134,992
                           '- domainTb com.sun.jmx.mbeanserver.Repository @
0x80502d50                      |           32 |       135,200
-------------------------------------------------------------------------------------------------------------------------------------------

This suggests to me that JMX registry is getting created by bootstrap or app
classloader. Then we're registering MBeans with it for each running job.
Then when the job is called the classloader can't be cleaned up because
there is a reference to the object in the bootstrap classloader.

Is there anything that can be done about this in the job code? 





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Mime
View raw message