tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hitesh Shah (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TEZ-3435) WebUIService thread tries to use blacklisted disk, dies, and kills AM
Date Fri, 16 Sep 2016 14:33:20 GMT

     [ https://issues.apache.org/jira/browse/TEZ-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hitesh Shah resolved TEZ-3435.
------------------------------
    Resolution: Not A Bug

Resolving this based on the discussion. Thanks for reporting the issue [~mprim]. 



> WebUIService thread tries to use blacklisted disk, dies, and kills AM
> ---------------------------------------------------------------------
>
>                 Key: TEZ-3435
>                 URL: https://issues.apache.org/jira/browse/TEZ-3435
>             Project: Apache Tez
>          Issue Type: Bug
>          Components: UI
>    Affects Versions: 0.8.4
>            Reporter: Michael Prim
>            Priority: Critical
>
> We recently hit an issue that certain TEZ jobs died when scheduled on a node that had
a broken disk. The disk was already marked as broken and excluded by YARN node manager. Other
applications worked fine on that node, only TEZ jobs died.
> The error where ClassNotFound exceptions, of basic hadoop classes, which should be available
everywhere. After some investigation we found out that the WebUIService thread, spawned by
the AM tries to utilize that broken disk. See stacktrace, disk3 was excluded by node manager.
> {code}
>  [WARN] [ServiceThread:org.apache.tez.dag.app.web.WebUIService] |mortbay.log|: Failed
to read file: /volumes/disk3/yarn/nm/filecache/9017/hadoop-mapreduce-client-core-2.6.0.jar
> java.util.zip.ZipException: error in opening zip file
> 	at java.util.zip.ZipFile.open(Native Method)
> 	at java.util.zip.ZipFile.<init>(ZipFile.java:219)
> 	at java.util.zip.ZipFile.<init>(ZipFile.java:149)
> 	at java.util.jar.JarFile.<init>(JarFile.java:166)
> 	at java.util.jar.JarFile.<init>(JarFile.java:130)
> 	at org.mortbay.jetty.webapp.TagLibConfiguration.configureWebApp(TagLibConfiguration.java:174)
> 	at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1279)
> 	at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
> 	at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
> 	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> 	at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
> 	at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
> 	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> 	at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
> 	at org.mortbay.jetty.Server.doStart(Server.java:224)
> 	at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> 	at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:900)
> 	at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:273)
> 	at org.apache.tez.dag.app.web.WebUIService.serviceStart(WebUIService.java:94)
> 	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> 	at org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1827)
> 	at org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1848)
> {code}
> Which did lead to the ClassNotFound exceptions and killing the AM. Interesting enough
the DAGAppMaster was aware of this broken disk and did exclude it from the localDirs. It contains
only the remaining disks of the node.
> {code}
> [INFO] [main] |app.DAGAppMaster|: Creating DAGAppMaster for applicationId=application_1472223062609_42648,
attemptNum=1, AMContainerId=container_1472223062609_42648_01_000001, jvmPid=2538, userFromEnv=muhammad,
cliSessionOption=true, pwd=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648/container_1472223062609_42648_01_000001,
localDirs=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk10/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk4/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk5/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk6/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk7/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk8/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk9/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,
logDirs=/var/log/hadoop-yarn/container/application_1472223062609_42648/container_1472223062609_42648_01_000001
> {code}
> Actually this is quite an issue as in a huge data center you always have some broken
disks and by chance your AM may scheduled on one of this nodes.
> Summary: From my point of view it looks like as if the WebUIService thread does somehow
not properly take into account the local directories that are excluded by the node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message