hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Roelofs (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6837) Support for LZMA compression
Date Mon, 28 Jun 2010 19:47:56 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883265#action_12883265
] 

Greg Roelofs commented on HADOOP-6837:
--------------------------------------

Scott Carey wrote:

bq. lzma always decompresses 2 to 7 times as fast as bzip2 (only ~ half the decompression
speed of gzip).

I didn't see that in my tests.  My measurements (last column) are in terms of compressed MB/sec,
i.e., scaled by the compression ratio, but the ratios are close enough that that isn't a big
factor:

{noformat}
bzip2-1: text = 78.9% (1.1),   1.464 (0.028) ucMB/sec,   1.189 (0.037) cMB/sec
         bin  = 50.1% (3.4),   1.395 (0.021) ucMB/sec,   2.170 (0.036) cMB/sec
bzip2-9: text = 80.5% (1.0),   1.415 (0.028) ucMB/sec,   1.135 (0.037) cMB/sec
         bin  = 51.6% (3.6),   1.340 (0.020) ucMB/sec,   1.878 (0.032) cMB/sec

xz-1:    text = 79.6% (1.0),   2.705 (0.097) ucMB/sec,   1.457 (0.049) cMB/sec
         bin  = 53.3% (3.5),   1.820 (0.031) ucMB/sec,   2.93  (0.20)  cMB/sec
xz-9:    text = 82.4% (0.8),   0.240 (0.011) ucMB/sec,   1.433 (0.051) cMB/sec
         bin  = 57.2% (3.6),   0.351 (0.010) ucMB/sec,   2.73  (0.17)  cMB/sec
{noformat}

So xz/LZMA is definitely faster to decompress, but not immensely so.  (This was all C code.
 The "text" and "bin" measurements are averages across roughly 350 files of each type, various
sizes.  Not a perfect corpus, but it should be varied enough to draw some reasonable conclusions.
 On the other hand, the file sizes are definitely much smaller than is typical in Hadoop jobs.)

Btw, I didn't see Nicholas mention it, but all of the LZMA variants he tested appear to be
stream-compatible--that is, any of the tools can decompress any of the others' streams, possibly
modulo some header-parsing.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves
higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message