lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Refactored FuzzyTermEnum
Date Tue, 13 Jun 2006 19:14:20 GMT

I refactored the
edit distance implementation.  It now only uses a single
pair of arrays, and those never get resized.  That required
turning the order of text/target around in the loops.  You'll
see that with the pair of arrays method, they get re-used
hand-over-hand, and are assigned to local variables in the
tight loops.

I removed the calculation of min distance and replaced
it with a boolean -- the min wasn't needed, only the test vs.
the max.  I also flipped some variables around so there's
one less addition in the very inner loop and the arrays are
now looping in the ordinary way (starting at 0 with a < comparison).
I also commented out the redundant definition of the public close()
[which just called super.close() and had none of its own doc.]
I also just compute the max distance each time rather than
fiddling with an array -- it's just a little arithmetic done once
per term, but that could be put back.

I also rewrote min(int,int,int) to get rid of intermediate
assignments.  Is there a lib for this kind of thing?

An intermediate refactoring that does the hand-over-hand
with the existing array and resizing strategy is in

I ran the unit tests as follows on both versions (my hat's off to the
build.xml author(s) -- this all just worked out of the box and was
really easy to follow the first through):

C:\java\lucene-2.0.0>ant -Djunit.includes="" -Dtestcase=TestFuzzyQuery test
Buildfile: build.xml
    [javac] Compiling 1 source file to 
    [junit] Testsuite:
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.453 sec
Total time: 2 seconds

Does anyone have regression/performance test harnesses?
The unit tests were pretty minimal (which is a good thing!).
It'd be nice to test the min impl (ternary vs. if/then)
and the assumption about not allocating an
array of max distances.  All told, the refactored version
should be a modest speed improvement, mainly from
unfolding the arrays so they're local one-dimensional refs.

I don't know what the protocol is for one-off contributions.
I'm happy with the Apache license, so that shouldn't
be a problem.  I also don't know whether you use tabs
or spaces -- I untabified the final version and used your
two-space format in emacs.

- Bob Carpenter

View raw message