hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Roelofs (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-6835) Support concatenated gzip and bzip2 files
Date Tue, 29 Jun 2010 03:26:54 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-6835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Greg Roelofs updated HADOOP-6835:
---------------------------------

    Attachment: HADOOP-6835.v4.trunk-hadoop-common.patch
                HADOOP-6835.v4.trunk-hadoop-mapreduce.patch

Final(?) patch against trunk.  All the good stuff is in the -common half; the mapreduce half
has a new unit test and six supporting test files.

Because the unit test purports to be a generic concatenated-compressed-input test, I had ported
both of the main gzip subtests to work with the bzip2 decoder.  That turned up a possible
issue in the bzip2 decoder (or, equally likely, in the way I wrote the test case); for now,
I commented out that part of the test (near line 586).  The weird thing is that my own, more
rambunctious test case works fine.  This shouldn't be a blocker for a gzip-oriented patch,
but if a reviewer could take a quick look at the commented-out block and see if I did something
obviously wrong, I'd appreciate it.  (The problem didn't turn up in the Yahoo branch because
0.20.x doesn't support bzip2 concatenation, as the issue title notes.  The breakage appears
to occur precisely at the start of the second bzip2 "member.")

> Support concatenated gzip and bzip2 files
> -----------------------------------------
>
>                 Key: HADOOP-6835
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6835
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tom White
>            Assignee: Greg Roelofs
>         Attachments: grr-hadoop-common.dif.20100614c, grr-hadoop-mapreduce.dif.20100614c,
HADOOP-6835.v3.yahoo-0.20.2xx-branch.patch, HADOOP-6835.v4.trunk-hadoop-common.patch, HADOOP-6835.v4.trunk-hadoop-mapreduce.patch,
HADOOP-6835.v4.yahoo-0.20.2xx-branch.patch, MR-469.v2.yahoo-0.20.2xx-branch.patch
>
>
> When running MapReduce with concatenated gzip files as input only the first part is read,
which is confusing, to say the least. Concatenated gzip is described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message