lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Multi-node stats within individual nodes (was "Baby steps...")
Date Tue, 09 Mar 2010 20:13:20 GMT
On Tue, Mar 9, 2010 at 2:11 PM, Marvin Humphrey <> wrote:
>> > I don't know that compressing the raw materials is going to work as well as
>> > compressing the final product.  Early quantization errors get compounded when
>> > used in later calculations.
>> I would not compress for starters...
> How about lossless compression, then?  Do you need random access into this
> specialized posting list?  For the use cases you've described so far I don't
> think so, since you're just iterating it top to bottom on segment open.

Don't need random access -- just a full scan (or 2, if avg needs to be
regen'd) on startup.

> You could store the total length of the field in tokens and the number of
> unique terms as integers, compressing with vbyte, PFOR or whatever... then
> divide at search time to get average term frequency.  That way, you also avoid
> committing to a float encoding, which I don't think Lucene has standardized
> yet.

Yeah I think that's a great starting approach...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message