lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean O'Connor <>
Subject Re: SpanQuery parser? Update (ugly hack inside...)
Date Mon, 07 Nov 2005 22:43:24 GMT
Erik Hatcher wrote:

> On 4 Nov 2005, at 18:32, Sean O'Connor wrote:
>> I'm posting this primarily hoping to give back a tiny bit to a very  
>> helpful community. More likely however, someone else will open my  
>> eyes to an easier approach than what I outline below...
>> I've come up with a very ugly conversion approach from regular  Query 
>> objects into SpanQuery objects. I then use the converted  SpanQuery 
>> to get span positions (currently both token #, and start/ end 
>> position). In effect, I have highlighting for simple queries  with a 
>> very inefficient approach (yea for me!).
> As you and I have talked about on a couple of face to face occasions,  
> this is the approach I am taking on a current consulting project.  My  
> conversion code is slightly different than yours in that I don't  
> rewrite the query, but translate it as-is into comparable SpanQuery  
> subclasses - and this is because I have a RegexQuery and  
> SpanRegexQuery that are comparable.  But rewriting is a good  
> pragmatic way to go for general query types that don't have a  
> comparable SpanQuery subclass.
>> The goal(s) I am trying to accomplish is rather specific I think,  so 
>> I imagine the use of my hacking is rather limited (i.e. just to  me).
>> At the moment my code:
>>    * parses the search text (i.e. user entered query)
> Are you using QueryParser?  If so, you'll also want to account for  
> BooleanQuery, recursively.

I am using QueryParser. So far I have taken the easy route, and just 
deal with 'Or' BooleanQueries. The additional aspects of Boolean query 
(required and prohibited) should not be much of a stretch.

>>    * rewrites the resulting query to expand wildcards and such against
>>      index
>>    * calls a recursive conversion function with very basic conversion
>>      understanding
>>          o TermQuery -> SpanTerm
>>          o PhraseQuery -> SpanNear
>>          o others in progress as time permits
>> Currently, I only process simple query strings like:
>> "blue green yellow" => SpanOrQuery
>> "luce* acti*" => SpanOrQuery with wild cards expanded
>>    e.g.: lucene lucent action acting ... all or'ed together in a  
>> braindead fashion
>> "luce* acti* \"book rocks\"" => SpanOrQuery combining SpanTerms and  
>> SpanNear (no slop)
>>    er, hopefully you get the picture, I'm not up to showing a  vector 
>> of this one... :-)
>> I would be happy to discuss my approach if there is anyone  
>> interested. I assume I am pretty much alone in finding this  
>> ineffecient approach useful. For me, it is the functionality that  
>> overrides perfomance issues.
> What is inefficient about it?   The rewrite stuff is the main  
> difference, and perhaps that is the issue you're encountering.  Where  
> do you see the performance issues?
> Converting a query, for me at least, is fast - perhaps because there  
> is no rewriting involved.

Good question. I haven't done any performance testing, nor am I seeing 
any performance problems with lucene. I just assumed that my approach 
was adding an extra (unoptimized) layer. So for now, forgot I mentioned 
that :-).

>> I have something which can take user search strings and do hit  
>> highlighting for the exact hit found. This is really only useful  for 
>> "termA near 'some phrase'" at the moment, but might become more  
>> advanced in the next 2-3 months.
> I'm basically implementing this very thing.  I will likely be  
> enhancing the contrib/highlighter code in the next month to use  
> SpanQuery for highlighting, as well as adding field-aware highlighting.
Excellent. I will keep an eye out for it. Thanks for the heads up.

>     Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message