hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12990) lz4 incompatibility between OS and Hadoop
Date Fri, 01 Apr 2016 16:45:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221946#comment-15221946
] 

Harsh J commented on HADOOP-12990:
----------------------------------

I don't think our Lz4Codec implementation actually uses the FRAME specification (http://cyan4973.github.io/lz4/lz4_Frame_format.html)
when creating text based files. It seems it was added in as a codec for use inside block compression
formats such as SequenceFiles/HFiles/etc., but wasn't oriented towards Text files from the
looks of it, or was introduced at a time when there was no FRAME specification of LZ4.

The lz4 utility uses the frame specification:

{code}
# cat actual-file.txt
hadoop,foo,hadoop,foo,hadoop,foo,hadoop,foo
# lz4 actual-file.txt
# lz4cat actual-file.txt.lz4
hadoop,foo,hadoop,foo,hadoop,foo,hadoop,foo
# cat actual-file.txt.lz4 | od -X
0000000 184d2204 15a74064 bf000000 6f646168
0000020 662c706f 0b2c6f6f 2c500900 0a6f6f66
0000040 00000000 cf718d62
0000050
{code}

Note the header magic bytes that match the FRAME specification: {{184d2204}}, as per http://cyan4973.github.io/lz4/lz4_Frame_format.html

Whereas in Hadoop, we produce just the actual block compression form:

{code}
# cat StreamCompressor.java 
import org.apache.hadoop.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.conf.*;

public class StreamCompressor {

  public static void main(String[] args) throws Exception {
    String codecClassname = args[0];
    Class<?> codecClass = Class.forName(codecClassname);
    Configuration conf = new Configuration();
    CompressionCodec codec = (CompressionCodec)
      ReflectionUtils.newInstance(codecClass, conf);
    
    CompressionOutputStream out = codec.createOutputStream(System.out);
    IOUtils.copyBytes(System.in, out, 4096, false);
    out.finish();
  }
}
# javac -cp $(hadoop classpath) StreamCompressor.java
# java -cp $PWD StreamCompressor < actual-file.txt > hadoop-file.txt.lz4
# # cat hadoop-file.lz4 | od -X
0000000 2c000000 15000000 646168bf 2c706f6f
0000020 2c6f6f66 5009000b 6f6f662c 0000000a
0000035
{code}

Note that we are not writing any of the FRAME required elements (magic header, etc.), but
are only writing the compressed block directly.

Therefore, fundamentally, we are not interoperable with the {{lz4}} utility. The difference
is very similar to the GPLExtras' {{LzoCodec}} vs. {{LzopCodec}}, the former is just the data
compressing algorithm, but the latter is an actual framed format interoperable with {{lzop}}
CLI utility.

To make ourselves interoperable, we'll need to introduce a new frame wrapping codec such as
{{LZ4FrameCodec}}, and users could use that when they want to decompress or compress text
data produced/readable by {{lz4/lz4cat}} CLI utilities.

> lz4 incompatibility between OS and Hadoop
> -----------------------------------------
>
>                 Key: HADOOP-12990
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12990
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io, native
>    Affects Versions: 2.6.0
>            Reporter: John Zhuge
>            Priority: Minor
>
> {{hdfs dfs -text}} hit exception when trying to view the compression file created by
Linux lz4 tool.
> The Hadoop version has HADOOP-11184 "update lz4 to r123", thus it is using LZ4 library
in release r123.
> Linux lz4 version:
> {code}
> $ /tmp/lz4 -h 2>&1 | head -1
> *** LZ4 Compression CLI 64-bits r123, by Yann Collet (Apr  1 2016) ***
> {code}
> Test steps:
> {code}
> $ cat 10rows.txt
> 001|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 002|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 003|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 004|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 005|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 006|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 007|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 008|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 009|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 010|c1|c2|c3|c4|c5|c6|c7|c8|c9
> $ /tmp/lz4 10rows.txt 10rows.txt.r123.lz4
> Compressed 310 bytes into 105 bytes ==> 33.87%
> $ hdfs dfs -put 10rows.txt.r123.lz4 /tmp
> $ hdfs dfs -text /tmp/10rows.txt.r123.lz4
> 16/04/01 08:19:07 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
>     at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
>     at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>     at java.io.InputStream.read(InputStream.java:101)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
>     at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:106)
>     at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:101)
>     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>     at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>     at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>     at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>     at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message