hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Broberg <Tim.Brob...@exar.com>
Subject RE: Making Gzip splittable
Date Wed, 22 Feb 2012 18:14:44 GMT

There are three options here:
 1 - Add your codec, and alternative to the default gzip codec.
 2 - Modify the gzip codec to incorporate your feature so that it is pseudo-splittable by
default (skippable?)
 3 - Do nothing

The code uses the normal splittability interface and doesn't invent some new solution. It
seems perfectly well explained.

There is a lot of explanation in there on how to switch over from one codec to the other.
Does it all get simpler if skippability is implemented by default but the option is not enabled?

Does this make things any less potentially confusing?

    - Tim.

From: niels@basj.es [niels@basj.es] On Behalf Of Niels Basjes [Niels@basjes.nl]
Sent: Sunday, February 19, 2012 4:23 PM
To: common-dev
Subject: Making Gzip splittable


As some of you know I've created a patch that effectively makes Gzip


What this does is for a split somewhere in the middle of the file it will
read from the start of the file up until the point where the split starts.
This is a useful waste of resources because it creates room to run a heavy
lifting mapper in parallel.
Due to this balance between the waste being useful and the waste being
wasteful I've included extensive documentation in the patch on how it works
and how to use it.

I've seen that there are quite a few real life situations where I expect my
solution can be useful.

What I created is as far as I can tell the only way you can split a gzipped
file without prior knowledge about the actual file.
If you do have prior information then other directions with a similar goal
are possible:
- Analyzing the file beforehand:
- Create a specially crafted gzipped file:

Over the last year I've had review comments from Chris Douglas (until he
stopped being involved in Hadoop) and later from Luke Lu.

Now the last feedback I got from Luke is this:

> Niels, I'm ambivalent about this patch. It has clean code and
> documentation, OTOH, it has really confusing usage/semantics and
> dubious general utility that the community might not want to maintain
> as part of an official release. After having to explain many finer
> points of Hadoop to new users/developers these days, I think the
> downside of this patch might out weight its benefits. I'm -0 on it.
> i.e., you need somebody else to +1 on this.

So after consulting Eli I'm asking this group.

My views on this feature:
- I think this feature should go in because I think others can benefit from
- I also think that it should remain disabled by default. It can then be
used by those that read the documentation.
- The implementation does not contain any decompression code at all. It
only does the splitting smartness. (It could even be refactored to make any
codec splittable). It has been tested with both the java and the native

What do you think?

Is this a feature that should go in the official release or not?

Best regards

Niels Basjes

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

View raw message