hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rick Weber (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-6901) Parsing large compressed files with HADOOP-1722 spawns multiple mappers per file
Date Thu, 05 Aug 2010 16:23:21 GMT
Parsing large compressed files with HADOOP-1722 spawns multiple mappers per file

                 Key: HADOOP-6901
                 URL: https://issues.apache.org/jira/browse/HADOOP-6901
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 0.21.0
         Environment: Hadoop v0.20.2 + HADOOP-1722
            Reporter: Rick Weber

This was originally discovered while using Dumbo to parse a very large (2G) compressed file.
 By default, Dumbo will attempt to use the AutoInputFormat as the input format.  

Here is my use case:

I have a large (2Gb) compressed file.  It's compressed using the default method, which is
Gzip based and is unsplittable.  Due to the size, the default implementation of AutoInputFormat
says that this file is splittable. As a result, this file is split into about 35 parts, and
each one is assigned to a Map job.

However, since the file itself is unsplittable, each Map job winds up parsing the file again
from the beginning.  This basically means my job has 35x the data, and takes 35x long to run.

If I set "-inputformat text", this problem does not appear in dumbo. If I manually call the
streaming jar and use AutoInputFormat, this
problem appears.

Looking at the code in AutoInputFormat, it appears that it uses the default isSplittable()
method from InputFormat, which indicates everything is splittable.  I think that this class
should define it's own isSplittable method similar to what is defined in the TextInputFormat
class, which basically says it's splittable if it's not compressed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message