lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timo Nentwig <>
Subject Re: Caching FuzzyQuery
Date Sat, 15 Dec 2007 19:23:40 GMT
On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
> : Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
> : caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
> : BooleanQueries.
> Applications are certainly welcome to do this (there is nothing to stop
> you from calling rewrite before passing the query to your Searcher, i
> believe the overhead of calling rewrite on a query that's already been
> rewritten is fairly low) but I don't think it would be a good idea to add

Why should subsequent rewrites be faster? The query is being rewritten every 
time over and over again. Even *if* you can profit by buffered IO you sill 
have a plenty of string levenshtein OPs.

I'm against caching in general because you always run into some hard to 
understand and examine problem but this seems to be one of the rare cases 
where caching makes sense.

I attached a small test app, the index contains 2.2 million docs and 5 million 
terms, I search for a pretty common term which was rewritten to 15 terms and 
hit roughly 4.000 docs (I also tried a term that was rewritten to 1 term and 
hit about 300 docs, made no difference):

rewritten in 809
Overall search time: 842
rewritten in 271
Overall search time: 274
rewritten in 216
Overall search time: 219
rewritten in 180
Overall search time: 182
rewritten in 184
Overall search time: 186
rewritten in 220
Overall search time: 226
rewritten in 207
Overall search time: 208
rewritten in 181
Overall search time: 183
rewritten in 183
Overall search time: 185
rewritten in 180
Overall search time: 181

$ vmstat -S M
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0    757    298     56    384    0    0    21    36   39    9  5  1 94  0

> something like this to the core ...for starters we are trying to move
> away from "hidden" caches like this that are not transparent (and

Well, at least the existing of such an decorator (which you explicitly have to 
use) will give you a hint that this is performance hot spot. I took me quite 
some time to figure it out...

> controllable) but the users because they have the potential to eat up a
> lot of ram.  But also: he amount of time needed to rewrite the query is
> probably not vastly more expensive then the anout of time to execute the
> search .. you might as well cache the entire result keyed off of the
> orriginal query (and not just the rewritten query object).
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message