lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <>
Subject [jira] Commented: (LUCENE-2810) Stored Fields Compression
Date Mon, 13 Dec 2010 15:11:02 GMT


Simon Willnauer commented on LUCENE-2810:

bq. I think Grant was looking for something that could compress across fields of different
documents (i.e. where every document represents a log record).

right, I understand the reason for that issue and I think it makes sense to implement something
like that on a codec basis as it has a whole bunch or limitations I guess. For instance how
do you deal with partial field loading? I mean you have to decompress everything really unless
you have a compression scheme that allows to do something like a block cipher though. Not
sure if something like that is around so input is appreciated. Still unless we don' t have
codec support for stored fields I think we should not do it though.


> Stored Fields Compression
> -------------------------
>                 Key: LUCENE-2810
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents
contain a lot of redundant information and end up wasting a lot of space across a large collection
of documents.  For instance, simply compressing a typical log file often results in > 75%
compression rates.  We should explore mechanisms for applying compression across all the documents
for a field (or fields) while still maintaining relatively fast lookup (that being said, in
most logging applications, fast retrieval of a given event is not always critical.)  For instance,
perhaps it is possible to have a part of storage that contains the set of unique values for
all the fields and the document field value simply contains a reference (could be as small
as a few bits depending on the number of uniq. items) to that value instead of having a full
copy.  Extending this, perhaps we can leverage some existing compression capabilities in Java
to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make sense as a
Codec, if and when we have support for changing storage Codecs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message