hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Making Gzip splittable
Date Mon, 20 Feb 2012 00:23:37 GMT

As some of you know I've created a patch that effectively makes Gzip


What this does is for a split somewhere in the middle of the file it will
read from the start of the file up until the point where the split starts.
This is a useful waste of resources because it creates room to run a heavy
lifting mapper in parallel.
Due to this balance between the waste being useful and the waste being
wasteful I've included extensive documentation in the patch on how it works
and how to use it.

I've seen that there are quite a few real life situations where I expect my
solution can be useful.

What I created is as far as I can tell the only way you can split a gzipped
file without prior knowledge about the actual file.
If you do have prior information then other directions with a similar goal
are possible:
- Analyzing the file beforehand:
- Create a specially crafted gzipped file:

Over the last year I've had review comments from Chris Douglas (until he
stopped being involved in Hadoop) and later from Luke Lu.

Now the last feedback I got from Luke is this:

> Niels, I'm ambivalent about this patch. It has clean code and
> documentation, OTOH, it has really confusing usage/semantics and
> dubious general utility that the community might not want to maintain
> as part of an official release. After having to explain many finer
> points of Hadoop to new users/developers these days, I think the
> downside of this patch might out weight its benefits. I'm -0 on it.
> i.e., you need somebody else to +1 on this.

So after consulting Eli I'm asking this group.

My views on this feature:
- I think this feature should go in because I think others can benefit from
- I also think that it should remain disabled by default. It can then be
used by those that read the documentation.
- The implementation does not contain any decompression code at all. It
only does the splitting smartness. (It could even be refactored to make any
codec splittable). It has been tested with both the java and the native

What do you think?

Is this a feature that should go in the official release or not?

Best regards

Niels Basjes

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message