hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression
Date Mon, 03 Oct 2016 14:42:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542571#comment-15542571
] 

Jason Lowe commented on HADOOP-13578:
-------------------------------------

Sorry for the delay in getting a more detailed review.  Before I delved deep into the code
I ran the codec through some basic tests and found a number of problems.

The native code compiles with warnings that should be cleaned up:
{noformat}
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:
In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’:
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110:
warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110:
warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:
In function ‘Java_org_apache_hadoop_io_compress_zstd_ZStandardDecompressor_decompressBytes’:
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110:
warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
[WARNING] /hadoop/y-src/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardDecompressor.c:110:
warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
{noformat}

The codec is not working as an intermediate codec for MapReduce jobs.  Running a wordcount
job with -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec
works, but specifying -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.ZStandardCodec
causes the reducers to fail while fetching outputs complaining about premature EOF:
{noformat}
2016-10-03 13:51:32,140 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher:
fetcher#5 about to shuffle output of map attempt_1475501532481_0007_m_000000_0 decomp: 323113
len: 93339 to MEMORY
2016-10-03 13:51:32,149 WARN [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher:
Failed to shuffle for fetcher#5
java.io.IOException: Premature EOF from inputStream
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:209)
	at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.doShuffle(InMemoryMapOutput.java:90)
	at org.apache.hadoop.mapreduce.task.reduce.IFileWrappedMapOutput.shuffle(IFileWrappedMapOutput.java:63)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:536)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
{noformat}

The codec also has some issues with MapReduce jobs when reading input from a previous job's
output that has been zstd compressed.  For example, this sequence of steps generates output
one would expect, where we're effectively word counting the output of wordcount on /etc/services
(just some sample input for wordcount):
{noformat}
$ hadoop fs -put /etc/services wcin
$ hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount
-Dmapreduce.map.output.compress=true -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
wcin wcout-gzip
$ hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount
wcout-gzip wcout-gzip2
{noformat}
But if we do the same with org.apache.hadoop.io.compress.ZStandardCodec there's an odd record
consisting of about 25K of NULs (i.e.: 0x00 bytes) in the output of the second job.

The output of the ZStandardCodec is not readable by the zstd CLI utility, nor is output generated
by the zstd CLI utility readable by ZStandardCodec.

> Add Codec for ZStandard Compression
> -----------------------------------
>
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 months
by facebook now.  v1.0 was recently released.  Create a codec for this library.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message