lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jose Luna <>
Subject Re: Advice regarding fuzzy phrase searching
Date Wed, 12 Dec 2007 16:15:25 GMT
Mark, Russ, thanks for the replies.

Mark, this looks great, I think it's exactly what I was looking for.  I 
think this should definitely be added to Lucene when it is stable 
enough.  I suspect there are others that would find it useful.


Mark Miller wrote:
> Take a look at:
> This is an extension to the Highlighter that highlights span and 
> proximity queries. If you rewrite the query it will also do fuzzy 
> queries. I am sure you can easily steal some of the code to do what 
> you want.
> Keep in mind, because of how Lucene's SpanQuery works, if you say to 
> find 'mark within 4 of ball', Lucene will not find all occurrences. 
> ie: 'mark close to ball ball' -- even if you say find mark within 20 
> of ball, a Span query will only find the first occurrence of ball even 
> though both occurrences are within 20. If ball was on both sides of 
> mark, both would match, but after finding the first ball with 20 of 
> mark, Span doesnt continue looking for another.
> - Mark
> Jose Luna wrote:
>> Hello,
>> I am looking for some advice regarding which tools I might use to 
>> solve my problem.  I apologize ahead of time for the long explanation.
>> Problem Description:  I would like to index a set of very large HTML 
>> documents.  I would then be able to run two different kinds of 
>> queries: proximity queries, and fuzzy phrase queries.   I would like 
>> to get the exact positions of the matching results from the query (I 
>> need to modify the original documents at these positions.)  I will 
>> only need to search one document at a time, i.e., I already know 
>> which document I'll be looking in, so what's important is finding the 
>> positions of the hits within that document.
>> For example,  for a fuzzy search, I may want to search for "arterial 
>> oxygen saturation".   I would want this to match "arterial oxygen 
>> saturate", and I would want to get the position of where it matches.  
>> I would also like to do proximity searches, with these broken into 
>> separate terms.  So, I may be searching for "arterial", "oxygen", and 
>> "saturate" all within 10 terms of each other, and get the positions 
>> of the cases that match.
>> To the best of my understanding, Lucene is not a good choice to solve 
>> this problem (please correct me if I'm wrong).   As far as I can 
>> tell, Lucene breaks up a document into a set of terms, and indexes 
>> these in some sort of structure.  My guess is a B+ tree, but I'm 
>> curious to learn more about it -- I couldn't find much in the 
>> documentation about the underlying index structure.   Anyway, this 
>> means that the keys->pointer pairs in the index are basically 
>> term->documenID pairs.  So this isn't very suitable for my problem. I 
>> already know which document I want to search, I'm interested in the 
>> position of hits.    If I were to search for the phrase "arterial 
>> oxygen saturation", this would be broken into terms and I could 
>> iterate through all of the TermPositions for a given term in the 
>> document, and try to find out where these terms are adjacent in the 
>> document.  Considering that my document is very large, the phrases 
>> can be 10+ terms, and I need to do this hundreds of times, this 
>> doesn't sound like a very good solution.  If we introduce the idea of 
>> fuzzy matches and proximity searches, it seems like this task of 
>> iterating through TermPositions becomes very complicated.
>> I've spent time reading the docs, creating a test program, and 
>> reading the mailing list.  As far as I can tell, Lucene is geared 
>> towards document based queries, and isn't the ideal tool for my 
>> problem.  I think an index based on a suffix tree (or variation of) 
>> would better meet my needs, but I'm not sure how well these perform 
>> with fuzzy and proximity searches.  I've looked around, and I can't 
>> seem to find a good opensource indexing framework like lucene that's 
>> based on a suffix tree.  Are there any suggestions for tools that 
>> would help with this problem?  Does anyone have any suggestions on 
>> how I might bend Lucene to meet my needs?
>> Thanks in advance,
>> JLuna
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message