hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6502) DistributedFileSystem#listStatus is very slow when listing a directory with a size of 1300
Date Sat, 23 Jan 2010 02:35:23 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804000#action_12804000

Arun C Murthy commented on HADOOP-6502:

I had a discussion with Hairong about this one... 

The important thing to note is that listStatus as described in this jira *iff* we are writing
a stand-alone DFSClient without map-reduce jars in the classpath of the application. It is
so because if JobConf/JobConfigurable are in the classpath the look-up is done only once and
is cached in Configuration.getClassByName. I believe Hairong was running a stand-alone test
where she did not have MR jars on the classpath.

Unfortunately, we cannot remove this feature from ReflectionUtils without breaking applications.
I'd propose we fix it by removing the offending code at the same time we remove the deprecated
org.apache.hadoop.mapred package. Until then it affects a small minority of applications...
even bin/hadoop commands are 'ok' since MR jars are on the classpath. 

However bin/hdfs might have this problem... sigh.

> DistributedFileSystem#listStatus is very slow when listing a directory with a size of
> ------------------------------------------------------------------------------------------
>                 Key: HADOOP-6502
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6502
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 0.20.0
>            Reporter: Hairong Kuang
>            Priority: Critical
>             Fix For: 0.20.2, 0.21.0, 0.22.0
> When listing a directory of around 1300 children, it takes hundreds of milliseconds.
It turns out the slowdowness is caused by the change made by HADOOP-4187. The return value
of listStatus is an array of FileStatus. When deserializing each element of the array, ReflectionUtils#newInstance(Class<T>,
Configuration) is called and then calls setConf, which calls setJobConf. SetJobConf checks
if JobConf is on the class path by calling Configuration#getClassByName. Even though Configuration#getClassByName
tries to optimize the lookup using a cached map, but since JobConf is not in the class path,
so it is not in the cache. Every checkup ends up calling Class.ForName which is very expensive.
Deserializing an array of 1300 entries requires calling of Class#ForName 1300 times!

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message