lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <>
Subject [jira] [Updated] (LUCENE-4599) Compressed term vectors
Date Wed, 19 Dec 2012 17:25:12 GMT


Adrien Grand updated LUCENE-4599:

    Attachment: LUCENE-4599.patch

New patch (still not committable yet) with better compression ratio thanks to the following
 * the block of data compressed by LZ4 only contains term and payload bytes (without their
lengths), everything else (positions, flags, term lengths, etc.) is stored using packed ints,
 * term freqs are encoded in a pfor-like way to save space (this was a 3x/4x decrease of the
space needed to store freqs),
 * when all fields have the same flags (a 3-bits int that says whether positions/offsets/payloads
are enabled), the flag is stored only once per distinct field,
 * when both positions and offsets are enabled, I compute average term lengths and only store
the difference between the start offset and the expected start offset computed from the average
term length and the position,
 * for lengths, this impl stores the difference between the indexed term length and the actual
length (endOffset - startOffset), with an optimization when they are always equal to 0 (can
happen with ASCII and an analyzer that does not perform stemming).

Depending on the size of docs, not the same data takes most space in a single chunk:
|| || Small docs (28 * 1K) || Large doc (1 * 750K) ||
| Total chunk size (positions and offsets enabled) | 21K | 450K |
| Term bytes | 11K (16K before compression) | 64K (84K before compression) |
| Term lengths | 2K | 8K |
| Positions | 3K | 215K |
| Offsets | 3K (4K if positions are disabled) | 150K (240K if positions are disabled) |
| Term freqs | 500 | 7K |
the rest is negligible

 * So with small docs, most of space is occupied by term bytes whereas with large docs positions
and offsets can easily take 80% of the chunk size.
 * Compression might not be as good as with stored fields, especially when docs are large
because terms have already been deduplicated.

Overall, the on-disk format is more compact than the Lucene40 term vectors format (positions
and offsets enabled, the number of documents indexed is not the same for small and large docs):
|| || Small docs || Large docs ||
| Lucene40 tvx | 160033 | 1633 |
| Lucene40 tvd | 49971 | 232 |
| Lucene40 tvf | 11279483 | 56640734 |
| Compressing tvx | 1116 | 78 |
| Compressing tvd | 7589550 | 44633841 |

This impl is 34% smaller than the Lucene40 one on small docs (mainly thanks to compression)
and 21% on large docs (mainly thanks to packed ints). If you have other ideas to improve this
ratio, let me know!

I still have to write more tests, clean up the patch, make reading term vectors more memory-efficient,
and implement efficient merging...
> Compressed term vectors
> -----------------------
>                 Key: LUCENE-4599
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs, core/termvectors
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>         Attachments: LUCENE-4599.patch, LUCENE-4599.patch
> We should have codec-compressed term vectors similarly to what we have with stored fields.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message