hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression
Date Wed, 05 Oct 2016 15:02:21 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15549007#comment-15549007
] 

Jason Lowe commented on HADOOP-13578:
-------------------------------------

Thanks for updating the patch!

bq. what do you think about adding a test that would go through all the codecs and run the
M/R job you used as your example with the MRMiniCluster

I think this would be overkill.  Those tests would not be fast to execute, especially across
all codecs.  Also they won't run during precommit builds that just touch the codecs since
they would necessarily have to be in the mapreduce project to pick up the MR cluster and MR
job support code.  If this was simply a case of returning the wrong buffer length then I'm
wondering why the basic sanity tests of the codec don't catch this.  Those tests should be
updated to catch this without needing to run the full cluster end-to-end scenario.

Most of the tests above now pass, but the codec is still not able to produce output that the
zstd CLI utility can understand, nor can the codec parse files that were created by the zstd
CLI utility.  For example, trying to decode a file created by the zstd CLI utility blows up
like this:
{noformat}
$ zstd < /etc/services | hadoop fs -put - services.zst
$ hadoop fs -text services.zst
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
2016-10-05 13:21:36,236 INFO  [main] compress.CodecPool (CodecPool.java:getDecompressor(181))
- Got brand-new decompressor [.zst]
-text: Fatal internal error
java.lang.ArrayIndexOutOfBoundsException
	at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.setInput(ZStandardDecompressor.java:109)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:104)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
	at java.io.InputStream.read(InputStream.java:101)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:91)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:65)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:126)
	at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:105)
	at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:100)
	at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:321)
	at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:293)
	at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:275)
	at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:259)
	at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
	at org.apache.hadoop.fs.shell.Command.run(Command.java:166)
	at org.apache.hadoop.fs.FsShell.run(FsShell.java:326)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
	at org.apache.hadoop.fs.FsShell.main(FsShell.java:384)
{noformat}

Similary the zstd CLI utility just complains that any output produced by the codec is not
in in zstd format, e.g.:
{noformat}
$ hadoop fs -cat wcout-zstandard/part-r-00000.zst | zstd -d
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
zstd: stdin: not in zstd format 
{noformat}

Looks like this codec is tacking on an extra 8 bytes, since when I strip those off then the
zstd CLI utility is able to decode it.  Similarly when the codec tries to read this Hadoop-specific
header from a file created by the zstd CLI utility then it blows up since that header is not
there.  The GzipCodec works without this Hadoop-specific header, so we can model this codec
after it to help make this codec compatible with the standard .zst file format.

The code still compiles with warnings:
{noformat}
[INFO] Running make -j 2 VERBOSE=1
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:
In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’:
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:82:
warning: unused variable ‘uncompressed_direct_buf_len’
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:
In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’:
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:82:
warning: unused variable ‘uncompressed_direct_buf_len’
{noformat}

Those warnings are a red flag.  The code is not checking if the amount of space available
in the uncompressed bytes buffer.  If I'm reading the code correctly, it's asking the zstd
library how many bytes would be needed to fully decompress the data and passing that as the
output buffer size rather than the actual buffer size.  That's very bad, since corrupted or
malicious input could cause the codec into a buffer overflow and corrupt the JVM itself. 
Seems like we either need to call the codec with the actual output buffer size and hope the
data from the compressed frame fits (and the user needs to bump the output buffer size if
it doesn't) or we need to switch the codec to use the streaming mode.

Error handling in the native decompressor code should be much more informative to the user.
 Currently it just complains about a size mismatch, but it should be using the ZSTD_isError
and ZSTD_getErrorName like the compressor native code.  Then if the output buffer size is
too small then the user could see that error message and hopefully discern what to do.

HADOOP-13684 looks relevant to this codec since it was heavily copied from Snappy.  Looks
like the init code error handling needs to differentiate between native libraries not being
loaded and the native library not having ZStandard support included.

Nits:
- clazz has a suppress warnings for unchecked that looks like it should be rawtypes instead.
- compressionLevel has a suppress warnings for unchecked that is unnecessary
- "compressoin" s/b "compression"
- There's no indentation in this block:
{noformat}
    if (uncompressed_bytes == 0) {
    return (jint)0;
    }
{noformat}
- The indentation in general for the native code is very inconsistent.  The decompressor file
alone uses indents varying across 0, 1, 2, 4, and 6 spaces.


> Add Codec for ZStandard Compression
> -----------------------------------
>
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 months
by facebook now.  v1.0 was recently released.  Create a codec for this library.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message