lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/
Date Thu, 17 Nov 2005 22:03:14 GMT
Yonik Seeley wrote:
> I'm not sure I understand why this is.  epsilon is based on 1,
> (smallest number such that 1-epsilon != 1, right?).  What's special
> about 1?

1 is special for multiplication, but, you're right, not so special for 
addition, the operation in question.  The thing that makes addition 
accurate is more mantissa bits.  Epsilon is proportional to the number 
of mantissa bits.  So smaller epsilons will give us more accuracy, but, 
you're right, a particular epsilon value won't guarantee us accuracy.

> I'm worried about the impact of things like this:
>  smallfloat(10) + smallfloat(1) + smallfloat(1) + smallfloat(1) -> 10
> And it makes things very order dependent:
>  smallfloat(1) + smallfloat(1) + smallfloat(1) + smallfloat(10) -> 12

10 and 12 are pretty close scores, so while this is clearly not a good 
thing, relevant and irrelevant documents are hopefully separated by more 
than this.  In any case, it would be a whole lot more accurate than 
ignoring tfs altogether.  And we can do better in this particular case, 
using 4 or 5 bit mantissas.

> Also, epsilon related to the mantissa, not the exponent?
> That would make it 1/8, not 1/32.

I'm not sure what you're saying.  The current epsilon, with 3-bit 
mantissa, is 1/8, right?  With a five bit mantissa it would go to 1/32, no?

> Also, if we don't need to represent very small numbers, we could lower
> the zero point of the exponent (currently it's 15 for the 5/3 split),
> right?

Right.  Arguably we don't need numbers smaller than 1/100.  A 4-bit 
mantissa with a zero exponent point of 5 gives a minimum value of .0005 
and a max of 2M, plenty of range.  A 5-bit mantissa with zero-exponent 
point of 2 gives us a minimum of .03 and a max of around 2k, nearly the 
desired range, but with greater precision.  In your case above, 10+1+1 
would give 12, moreover 10+.5+.5 would give 11.  I think this is 
probably the best choice.  What do you think?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message