hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Friedrich (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-5309) SSLFactory truststore reloader thread leak in TimelineClientImpl
Date Sat, 02 Jul 2016 04:10:10 GMT
Thomas Friedrich created YARN-5309:

             Summary: SSLFactory truststore reloader thread leak in TimelineClientImpl
                 Key: YARN-5309
                 URL: https://issues.apache.org/jira/browse/YARN-5309
             Project: Hadoop YARN
          Issue Type: Bug
          Components: timelineserver, yarn
    Affects Versions: 2.7.1
            Reporter: Thomas Friedrich

We found a similar issue as HADOOP-11368 in TimelineClientImpl. The class creates an instance
of SSLFactory in newSslConnConfigurator and subsequently creates the ReloadingX509TrustManager
instance which in turn starts a trust store reloader thread. 
However, the SSLFactory is never destroyed and hence the trust store reloader threads are
not killed.

This problem was observed by a customer who had SSL enabled in Hadoop and submitted many queries
against the HiveServer2. After a few days, the HS2 instance crashed and from the Java dump
we could see many (over 13000) threads like this:
"Truststore reloader thread" #126 daemon prio=5 os_prio=0 tid=0x00007f680d2e3000 nid=0x98fd
waiting on 
condition [0x00007f67e482c000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.security.ssl.ReloadingX509TrustManager.run
        at java.lang.Thread.run(Thread.java:745)

HiveServer2 uses the JobClient to submit a job:
Thread [HiveServer2-Background-Pool: Thread-188] (Suspended (breakpoint at line 89 in 

	owns: Object  (id=464)	
	owns: Object  (id=465)	
	owns: Object  (id=466)	
	owns: ServiceLoader<S>  (id=210)	
	ReloadingX509TrustManager.<init>(String, String, String, long) line: 89	
	FileBasedKeyStoresFactory.init(SSLFactory$Mode) line: 209	
	SSLFactory.init() line: 131	
	TimelineClientImpl.newSslConnConfigurator(int, Configuration) line: 532	
	TimelineClientImpl.newConnConfigurator(Configuration) line: 507	
	TimelineClientImpl.serviceInit(Configuration) line: 269	
	TimelineClientImpl(AbstractService).init(Configuration) line: 163	
	YarnClientImpl.serviceInit(Configuration) line: 169	
	YarnClientImpl(AbstractService).init(Configuration) line: 163	
	ResourceMgrDelegate.serviceInit(Configuration) line: 102	
	ResourceMgrDelegate(AbstractService).init(Configuration) line: 163	
	ResourceMgrDelegate.<init>(YarnConfiguration) line: 96	
	YARNRunner.<init>(Configuration) line: 112	
	YarnClientProtocolProvider.create(Configuration) line: 34	
	Cluster.initialize(InetSocketAddress, Configuration) line: 95	
	Cluster.<init>(InetSocketAddress, Configuration) line: 82	
	Cluster.<init>(Configuration) line: 75	
	JobClient.init(JobConf) line: 475	
	JobClient.<init>(JobConf) line: 454	
	MapRedTask(ExecDriver).execute(DriverContext) line: 401	
	MapRedTask.execute(DriverContext) line: 137	
	MapRedTask(Task<T>).executeTask() line: 160	
	TaskRunner.runSequential() line: 88	
	Driver.launchTask(Task<Serializable>, String, boolean, String, int, DriverContext)
line: 1653	
	Driver.execute() line: 1412	

For every job, a new instance of JobClient/YarnClientImpl/TimelineClientImpl is created. But
because the HS2 process stays up for days, the previous trust store reloader threads are still
hanging around in the HS2 process and eventually use all the resources available. 

It seems like a similar fix as HADOOP-11368 is needed in TimelineClientImpl but it doesn't
have a destroy method to begin with. 

One option to avoid this problem is to disable the yarn timeline service (yarn.timeline-service.enabled=false).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org

View raw message