lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3206) FST package API refactoring
Date Sat, 18 Jun 2011 10:41:48 GMT


Dawid Weiss commented on LUCENE-3206:

I think I know how to compare storing byte[] of UTF8 as compared to vint-encoded codepoints
in UTF32 -- I'll encode the wikipedia terms list in both ways and we will see what comes out.
Theoretically they should be very, very similar (and full unicode codepoints should generate
fewer arcs) because UTF8 uses an encoding scheme with similar overhead to vint encoding...
os if something is a single-byte sequence in UTF8, will remain single byte vint. Double-byte
UTF8 character will remaing double-byte vint (last double byte codepoint is 0x7ff=2047, whereas
the last double byte vint is 2^14=16384. And so on. So for text, vint-encoded UTF32 should
be more compact than UTF8... The gain is of course when your "labels" are not text, but arbitrary
bytes -- then byte[] representation would be nicer.

> FST package API refactoring
> ---------------------------
>                 Key: LUCENE-3206
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 3.2
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: 3.3, 4.0
>         Attachments: LUCENE-3206.patch
> The current API is still marked @experimental, so I think there's still time to fiddle
with it. I've been using the current API for some time and I do have some ideas for improvement.
This is a placeholder for these -- I'll post a patch once I have a working proof of concept.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message