hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "churro morales (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression
Date Thu, 06 Oct 2016 02:01:21 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550599#comment-15550599

churro morales commented on HADOOP-13578:

@jlowe thank you for the thorough review.  The reason that the zstd cli and hadoop can't read
each other's compressed / decompressed data is that ZStandardCodec uses the Block(Compressor|Decompressor)
stream.  I assumed this library would be used to compress large amounts of data.  So when
you use this stream each block gets a header and some compressed data.  I believe the 8 bytes
you are referring are two ints (the size of the compressed and uncompressed block).  If you
remove these headers then the cli will be able to read the zstd blocks and if you use the
zstd-cli and compress a file (prepend the header for sizes) it will work in hadoop. 

The snappy compressor / decompressor works in the same way.  I do not believe you can compress
in snappy format using hadoop then transfer the file locally and call Snappy.uncompress()
without removing the headers. 

If we do not want this to be compressed at a block level, that is fine.  Otherwise we can
just add a utility in hadoop to take care of the block headers like they did with hadoop-snappy
or some of the CLI libraries for snappy like snzip.  

As far as the decompressed bytes, I agree.  I will check to see that the size returned from
the function that tells you how many bytes are necessary to uncompress the buffer and ensure
thats not larger than our buffer size.  I can also add the isError and getErrorName to the
decompression library.  The reason I explicitly checked if the expected size was equal to
the desired size is because the error that zstd provided was too vague.  But I'll add it in
case there are other errors. 

Yes I will look at Hadoop-13684.  The build of the codec was very similar to snappy because
the license was BSD so we could package it in like snappy. 

I can also take care of the nits you described as well.  

Are we okay with the compression being at block level?  If we are then this implementation
will work just like all of the other block compression codecs where it will add / require
the header for the hadoop blocks.

Thanks again for the review.  

> Add Codec for ZStandard Compression
> -----------------------------------
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 months
by facebook now.  v1.0 was recently released.  Create a codec for this library.  

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message