lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Updated] (LUCENE-5914) More options for stored fields compression
Date Sun, 30 Nov 2014 06:13:13 GMT


Robert Muir updated LUCENE-5914:
    Attachment: LUCENE-5914.patch

Some updates:
* port to trunk apis (i guess this was outdated?)
* fix some javadoc bugs
* nuke lots of now-unused stuff in .compressing, only still used for term vectors
* improve float/double compression: it was not so effective and wasteful. these now write
1..5 and 1..9 bytes.

We should try to do more cleanup:
* I don't like the delegator. maybe its the best solution, but at least it should not write
its own file. I think we should revive SI.attributes (properly: so it rejects any attribute
puts on dv updates) and use that.
* The delegator shouldnt actually need to delegate the writer? If i add this code, all tests
    final StoredFieldsWriter in = format.fieldsWriter(directory, si, context);
    if (true) return in; // wrapper below is useless
    return new StoredFieldsWriter() {
This seems to be all about delegating some manual file deletion on abort() ? Do we really
need to do this? If we have some bugs around indexfiledeleter where it doesn't do the right
thing, enough to warrant such apis, then we should have tests for it. Such tests would also
show the current code deletes the wrong filename:
IOUtils.deleteFilesIgnoringExceptions(directory, formatName); // formatName is NOT the file
the delegator writes
But this is obselete if we add back SI.attributes.
* The header check logic should be improved. I don't know why we need the Reader.checkHeader
method, why cant we just check it with the other files? 
* We should try to use checkFooter(Input, Throwable) for better corruption messages, with
this type of logic. It does more an appends suppressed exceptions when things go wrong:
try (ChecksumIndexInput input =(...) {
  Throwable priorE = null;
  try {
    // ... read a bunch of stuff ... 
  } catch (Throwable exception) {
    priorE = exception;
  } finally {
    CodecUtil.checkFooter(input, priorE);
* Any getChildResources() should return immutable list: doesn't seem to always be the case.
Maybe assertingcodec can be improved to actually test this automatically.

I will look more tomorrow.

> More options for stored fields compression
> ------------------------------------------
>                 Key: LUCENE-5914
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 5.0
>         Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch
> Since we added codec-level compression in Lucene 4.1 I think I got about the same amount
of users complaining that compression was too aggressive and that compression was too light.
> I think it is due to the fact that we have users that are doing very different things
with Lucene. For example if you have a small index that fits in the filesystem cache (or is
close to), then you might never pay for actual disk seeks and in such a case the fact that
the current stored fields format needs to over-decompress data can sensibly slow search down
on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like log analytics,
and in that case you have huge amounts of data for which you don't care much about stored
fields performance. However it is very frustrating to notice that the data that you store
takes several times less space when you gzip it compared to your index although Lucene claims
to compress stored fields.
> For that reason, I think it would be nice to have some kind of options that would allow
to trade speed for compression in the default codec.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message