lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Encoding data in terms; UTF8 concerns?
Date Sun, 11 May 2014 08:30:21 GMT
Hi David,

 

the reason why NumericUtils does the encoding in that way is just: NumericField encoding was
introduced in Lucene 2.9, where all terms were char[], encoded in UTF-8 on the index side.
Because of that, encoding each byte with full 8 bits wuld have been a large overhead in index
size: Each term would get an additional byte, because java chars 128…255 would be encoded
in 2 bytes because of UTF-8. Because of this NumericField uses 7 bits only.

Because we cannot easily change the numeric encoding (we won’t be able to change it ever,
unless we have information about the terms in Field metadata on the index side), this encoding
stayed alive up to now – so it’s all about index backwards compatibility.

 

If you introduce a new field for spatial, you don’t need to take care about this. Since
Lucene 4 all terms are byte[] and are sorted in binary order. The order of terms in index
is given by BytesRef.compareTo(), which is pure binary. The good thing for us:  UTF-8 order
for string terms (which is used in Lucene) is identical to byte[] order, but it is different
to UTF-16 order (this is why we need a crazy backwards layer to read 3.x indexes: terms are
sorted slightly differently). We do full 8 bit encoding already for Collation fields see CollationKeyAttributeFactory,
which encoded terms instead of UTF-8 with their collation key).

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: david.w.smiley@gmail.com [mailto:david.w.smiley@gmail.com] 
Sent: Sunday, May 11, 2014 1:17 AM
To: dev@lucene.apache.org
Cc: Uwe Schindler; Michael McCandless
Subject: Encoding data in terms; UTF8 concerns?

 

I’m working on an encoding of numbers / data into indexed terms.  In the past I limited
the encoding to ASCII but now I’m doing it at a more raw/byte level.  Do I have to be aware
of UTF8 / sorting issues when I do this?  I noticed the following code in NumericUtils.java,
line 186:

    while (nChars > 0) {

      // Store 7 bits per byte for compatibility

      // with UTF-8 encoding of terms

      bytes.bytes[nChars--] = (byte)(sortableBits & 0x7f);

      sortableBits >>>= 7;

    }

It’s the comment more than anything that has my attention. Do I have to limit my bytes to
only the low 7 bits?  If so, why?  I’ve already written a bunch of code that generates the
terms without consideration for this, and I think a bug I’m looking at could be related
to this.

 

~ David

p.s. sorry to be CC’ing some folks directly but the mailing list is having problems


Mime
View raw message