lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "david.w.smiley@gmail.com" <david.w.smi...@gmail.com>
Subject Re: Encoding data in terms; UTF8 concerns?
Date Sun, 11 May 2014 14:59:29 GMT
Thank you for the background info Uwe!  It turns out my encoding was fine;
I had some other bug.
-- David

On Sunday, May 11, 2014, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi David,
>
>
>
> the reason why NumericUtils does the encoding in that way is just:
> NumericField encoding was introduced in Lucene 2.9, where all terms were
> char[], encoded in UTF-8 on the index side. Because of that, encoding each
> byte with full 8 bits wuld have been a large overhead in index size: Each
> term would get an additional byte, because java chars 128…255 would be
> encoded in 2 bytes because of UTF-8. Because of this NumericField uses 7
> bits only.
>
> Because we cannot easily change the numeric encoding (we won’t be able to
> change it ever, unless we have information about the terms in Field
> metadata on the index side), this encoding stayed alive up to now – so it’s
> all about index backwards compatibility.
>
>
>
> If you introduce a new field for spatial, you don’t need to take care
> about this. Since Lucene 4 all terms are byte[] and are sorted in binary
> order. The order of terms in index is given by BytesRef.compareTo(), which
> is pure binary. The good thing for us:  UTF-8 order for string terms (which
> is used in Lucene) is identical to byte[] order, but it is different to
> UTF-16 order (this is why we need a crazy backwards layer to read 3.x
> indexes: terms are sorted slightly differently). We do full 8 bit encoding
> already for Collation fields see CollationKeyAttributeFactory, which
> encoded terms instead of UTF-8 with their collation key).
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de <javascript:_e(%7B%7D,'cvml','uwe@thetaphi.de');>
>
>
>
> *From:* david.w.smiley@gmail.com<javascript:_e(%7B%7D,'cvml','david.w.smiley@gmail.com');>[mailto:
> david.w.smiley@gmail.com<javascript:_e(%7B%7D,'cvml','david.w.smiley@gmail.com');>]
>
> *Sent:* Sunday, May 11, 2014 1:17 AM
> *To:* dev@lucene.apache.org<javascript:_e(%7B%7D,'cvml','dev@lucene.apache.org');>
> *Cc:* Uwe Schindler; Michael McCandless
> *Subject:* Encoding data in terms; UTF8 concerns?
>
>
>
> I’m working on an encoding of numbers / data into indexed terms.  In the
> past I limited the encoding to ASCII but now I’m doing it at a more
> raw/byte level.  Do I have to be aware of UTF8 / sorting issues when I do
> this?  I noticed the following code in NumericUtils.java, line 186:
>
>     while (nChars > 0) {
>
>       // Store 7 bits per byte for compatibility
>
>       // with UTF-8 encoding of terms
>
>       bytes.bytes[nChars--] = (byte)(sortableBits & 0x7f);
>
>       sortableBits >>>= 7;
>
>     }
>
> It’s the comment more than anything that has my attention. Do I have to
> limit my bytes to only the low 7 bits?  If so, why?  I’ve already written a
> bunch of code that generates the terms without consideration for this, and
> I think a bug I’m looking at could be related to this.
>
>
>
> ~ David
>
> p.s. sorry to be CC’ing some folks directly but the mailing list is having
> problems
>


-- 
Sent from Gmail Mobile

Mime
View raw message