hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niels Basjes (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7909) Implement Splittable Gzip based on a signature in a gzip header field
Date Sun, 11 Dec 2011 22:32:40 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167229#comment-13167229

Niels Basjes commented on HADOOP-7909:

Let me elaborate my thoughts to aid you in constructing a sound design. 

Assume you call it .gz --> then your setup must be capable of handling the non-splittable
gzip files,  splittable gzip files and files where the chunck size is way off (say 100MB instead
of 64K).

A big part of the problem lies in the way the FileInputSplits are created: *_Without reading
the actual input file._* 
Have a look at TextInputFormat and it's super class FileInputFormat (both in hadoop-mapreduce-client-core).
The only check that is done is if the codec (selected by filename extension) implements SplittableCompressionCodec.
The splits are then created by looking at some config settings ... not the inputfile.

Assume you call it something custom (like the .sgz I mentioned) then you still have to be
able to handle the "huge chunks" situation. So if I create a 1000MB file with chunks of 100MB
and instruct the config to create 10MB splits ... there will be too many map tasks (running
independently on different nodes).

Or ... you actually _define_ both the extension and the chunksize as requirements.
To me that would that be a new file format based on gzip.

Just my 2 cents ...

> Implement Splittable Gzip based on a signature in a gzip header field
> ---------------------------------------------------------------------
>                 Key: HADOOP-7909
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7909
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tim Broberg
>            Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
> I propose to take the suggestion of PIG-42 extend it to
>  - add a more robust header such that false matches are vanishingly unlikely
>  - repeat initial bytes of the header for very fast split searching
>  - break down the stream into modest size chunks (~64k?) for rapid parallel encode and
>  - provide length information on the blocks in advance to make block decode possible
in hardware
> An optional extra header would be added to the gzip header, adding 36 bytes.
> <sh> := <version><signature><uncompressedDataLength><compressedRecordLength>
> <version> := 1 byte version field allowing us to later adjust the deader definition
> <signature> := 23 byte signature of the form aaaaaaabcdefghijklmnopr where each
letter represents a randomly generated byte
> <uncompressedDataLength> := 32-bit length of the data compressed into this record
> <compressedRecordLength> := 32-bit length of this record as compressed, including
all headers, trailers
> If multiple extra headers are present and the split header is not the first header, the
initial implementation will not recognize the split.
> Input streams would be broken down into blocks which are appended, much as BlockCompressorStream
does. Non-split-aware decoders will ignore this header and decode the appended blocks without
ever noticing the difference.
> The signature has >= 132 bits of entropy which is sufficient for 80+ years of Moore's
law before collisions become a significant concern.
> The first 7 bytes are repeated for speed. When splitting, the signature search will look
for the 32-bit value aaaa every 4 bytes until a hit is found, then the next 4 bytes identify
the alignment of the header mod 4 to identify a potential header match, then the whole header
is validated at that offset. So, there is a load, compare, branch, and increment per 4 bytes
> The existing gzip implementations do not provide access to the optional header fields
(nor comment nor filename), so the entire gzip header will have to be reimplemented and compression
will need to be done using the raw deflate options of the native library / built in deflater.
> There will be some degradation when using splittable gzip:
>  - The gzip headers will each be 36 bytes larger. (4 byte extra header header, 32 byte
extra header)
>  - There will be one gzip header per block.
>  - History will have to be reset with each block to allow starting from scratch at that
offset resulting in some uncompressed bytes that would otherwise have been strings.
> Issues to consider:
>  - Is the searching fast enough without the repeating 7 bytes in the signature?
>  - Should this be a patch to the existing gzip classes to add a switch, or should this
be a whole new class?
>  - Which level does this belong at? CompressionStream? Compressor?
>  - Is it more advantageous to encode the signature into the less dense comment field?
>  - Optimum block size? Smaller splits faster and may conserve memory, larger provides
slightly better compression ratio.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message