lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <>
Subject [jira] Commented: (LUCENE-2810) Stored Fields Compression
Date Mon, 13 Dec 2010 15:59:02 GMT


Grant Ingersoll commented on LUCENE-2810:

Compression 1.0 was a different use case.  That was for compressing a single field and I agree
it was a waste.  Where in my email did I say that users had to use it?  We have all kinds
of alternate things.  And you are so hung up on the word compression.  I will change the name
of this issue to something else without the word in it so you don't think this has to be some
form of gzip, but instead is about alternate storage options that will benefit particular
use cases.

> Stored Fields Compression
> -------------------------
>                 Key: LUCENE-2810
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents
contain a lot of redundant information and end up wasting a lot of space across a large collection
of documents.  For instance, simply compressing a typical log file often results in > 75%
compression rates.  We should explore mechanisms for applying compression across all the documents
for a field (or fields) while still maintaining relatively fast lookup (that being said, in
most logging applications, fast retrieval of a given event is not always critical.)  For instance,
perhaps it is possible to have a part of storage that contains the set of unique values for
all the fields and the document field value simply contains a reference (could be as small
as a few bits depending on the number of uniq. items) to that value instead of having a full
copy.  Extending this, perhaps we can leverage some existing compression capabilities in Java
to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make sense as a
Codec, if and when we have support for changing storage Codecs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message