lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2810) Stored Fields Compression
Date Mon, 13 Dec 2010 15:57:01 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970860#action_12970860
] 

Shai Erera commented on LUCENE-2810:
------------------------------------

bq. in any event, its useless to add any compression that doesn't beat what filesystems can
already do on average.

I'm not sure it's *useless* ... consider an application like Google Desktop Search developed
on top of Lucene. You cannot force users to compress the installation folder, yet it'll still
be valuable to have Lucene compress stuff on its own .. especially stuff that it chooses to
store. Such applications are special in that they offer a service to the user, that's installed
on his/her machine, and w/o control of the one that actually developed it. Therefore I find
myself tuning my Lucene-based app as much as I can, and often don't rely on users enabling
certain OS features (and who knows if one day those features will be gone?).

Today I handle compressed fields by using Lucene's CompressionTools, and I'm generally happy
with it. If however there will be a compressed-store that will improve the performance of
my application by compressing the stored fields otherwise, achieving better compression ratio
etc., it might be useful. Especially if its integration will be a no brainer.

I think though we'd want to differentiate fields - not all of them should be compressed, because
it means they'll need to be de-compressed, which might be expensive for some apps.

> Stored Fields Compression
> -------------------------
>
>                 Key: LUCENE-2810
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2810
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for documents
contain a lot of redundant information and end up wasting a lot of space across a large collection
of documents.  For instance, simply compressing a typical log file often results in > 75%
compression rates.  We should explore mechanisms for applying compression across all the documents
for a field (or fields) while still maintaining relatively fast lookup (that being said, in
most logging applications, fast retrieval of a given event is not always critical.)  For instance,
perhaps it is possible to have a part of storage that contains the set of unique values for
all the fields and the document field value simply contains a reference (could be as small
as a few bits depending on the number of uniq. items) to that value instead of having a full
copy.  Extending this, perhaps we can leverage some existing compression capabilities in Java
to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make sense as a
Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message