hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6502) DistributedFileSystem#listStatus is very slow when listing a directory with a size of 1300
Date Fri, 22 Jan 2010 20:52:21 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803862#action_12803862

Steve Loughran commented on HADOOP-6502:

_Disclaimer_ I have long advocated having a ASF exam in classloaders; nobody who hasn't passed
the exam would be allowed to mess with classloaders in any apache project. As there is no
such exam, there is no proof that I can be considered competent enough to do this, and you
should treat everything I say with caution. Test my statements, preferably in JUnit methods.

1.  Adding new classes is generally rare unless you are running something that is generating
java classes on the fly; JSP compilers do this. Even then, they try not to mess around with
things higher up the hierarchy (exception, JBoss default classloader, the one that's broken
that everyone hates).

2. Modern, OSGi-style classloaders are fairly strict, I don't think they add stuff higher
up. More of a general concern when you play with classloader trees are
# it's easy to leak classloaders. retain one ref to a class loaded by a child classloader
and the classloader never gets GC'd, doesn't pick up
updated JARs, consumes memory, stops your build overwriting any locked
JARs (windows only)
# all the rules about singletons and equality goes out the window.

3. I would go for caching the failure. For those people playing games with classloaders, tough.
But do note that if the JSP engine does need to
compile a JSP class, then Hadoop is adding classes to some classpath
in the Hadoop JVM. So your tools may be doing what you don't think is
happening, even on a "normal" Hadoop instance.

4. Looking at the code in more detail, the things a bit of an ugly hack, a contrived workaround
to avoid a cycle. If there was an elegant solution to this that didn't evolve reflection,
things would be much better. Nothing obvious springs to mind. 

> DistributedFileSystem#listStatus is very slow when listing a directory with a size of
> ------------------------------------------------------------------------------------------
>                 Key: HADOOP-6502
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6502
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.20.0
>            Reporter: Hairong Kuang
>            Priority: Critical
>             Fix For: 0.20.2, 0.21.0, 0.22.0
> When listing a directory of around 1300 children, it takes hundreds of milliseconds.
It turns out the slowdowness is caused by the change made by HADOOP-4187. The return value
of listStatus is an array of FileStatus. When deserializing each element of the array, ReflectionUtils#newInstance(Class<T>,
Configuration) is called and then calls setConf, which calls setJobConf. SetJobConf checks
if JobConf is on the class path by calling Configuration#getClassByName. Even though Configuration#getClassByName
tries to optimize the lookup using a cached map, but since JobConf is not in the class path,
so it is not in the cache. Every checkup ends up calling Class.ForName which is very expensive.
Deserializing an array of 1300 entries requires calling of Class#ForName 1300 times!

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message