lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5914) More options for stored fields compression
Date Mon, 01 Dec 2014 11:53:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229698#comment-14229698
] 

Adrien Grand commented on LUCENE-5914:
--------------------------------------

bq. Its also a bit wierd to be slurping in 3 read-once metadata files here. This adds complexity
at read... the current format is simpler here with just a single file. Can we avoid this?

I tried to look into it, but it's not easy. Lucene41 has its own custom stored fields index,
which is mostly the same thing as MonotonicBlockPackReader, so for this new codec, I wanted
to move the index to MonotonicBlockPackReader.

The index for stored fields basically stores two pieces of information: the first doc ID for
each block, and the first start pointer for each block. In Lucene41, blocks were interleaved,
but this is not something that the MonotonicBlockPackWriter allows for, this is why there
are 2 files: one for doc IDs and one for start pointers. Second limitation, at read time,
you need to know up-front how many values the MonotonicBlockPackReader stores in order to
be able to decode it. This is why we have a 3rd file for metadata that stores the number of
documents in the segment upon call to StoredFieldsWriter.finish.

I agree having 3 read-only files might look strange, but it's probably better than having
to re-implement specialized monotonic encoding?

> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 5.0
>
>         Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the same amount
of users complaining that compression was too aggressive and that compression was too light.
> I think it is due to the fact that we have users that are doing very different things
with Lucene. For example if you have a small index that fits in the filesystem cache (or is
close to), then you might never pay for actual disk seeks and in such a case the fact that
the current stored fields format needs to over-decompress data can sensibly slow search down
on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like log analytics,
and in that case you have huge amounts of data for which you don't care much about stored
fields performance. However it is very frustrating to notice that the data that you store
takes several times less space when you gzip it compared to your index although Lucene claims
to compress stored fields.
> For that reason, I think it would be nice to have some kind of options that would allow
to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message