hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niels Basjes (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-5925) NLineInputFormat silently produces garbage on gzipped input
Date Fri, 13 Jun 2014 09:45:03 GMT
Niels Basjes created MAPREDUCE-5925:

             Summary: NLineInputFormat silently produces garbage on gzipped input
                 Key: MAPREDUCE-5925
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5925
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Niels Basjes
            Priority: Critical

[ Found while investigating the impact of MAPREDUCE-2094 ]

The org.apache.hadoop.mapreduce.lib.input.NLineInputFormat (probably the mapred version too)
only makes sense for splittable files.

This inputformat uses the isSplitable from its superclass FileInputFormat (which always returns
true) in combination with the LineRecordReader.

When you provide it a gzipped file (non-splittable compression) it will create multiple splits
(isSplitable == true) yet the LineRecordReader cannot handle the gzipped file in multiple
splits because the GzipCodec does not support this.

Overall effect is that you get incorrect results.

Proposed solution: Add detection for this kind of scenario and let the NLineInputFormat fail
hard when someone tries this. 

I'm not sure if this should go into the LineRecordReader or only in the NLineInputFormat.

This message was sent by Atlassian JIRA

View raw message