lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Test code for regex queries
Date Thu, 24 Nov 2005 09:25:15 GMT

On 24 Nov 2005, at 03:17, Paul Elschot wrote:
>> I must admit that I haven't used the surround parser.  For my custom
>> parser (a legacy syntax that no one here would want), I take any term
>> that has an *, ?, or [...] syntax as a regex term.
> I had another look at the javadocs of java regex package.
> The normal brackets in a regex are not needed for queries, so they
> can be left as they are.

I don't understand what you mean about brackets not being needed.   
Brackets certainly factor in greatly into regular expressions:


What am I missing?

>> There are still some TODO's with the (Span)RegexQuery - such as being
>> wise about the prefix length.  Right now it is not wise enough.  I've
>> spent some time looking for a regex parser that could parse a regex
>> expression into an AST so that it could be used for determining the
>> last static character to start term enumeration.  This would also
>> come in very handy in being able to rotate a regular expression
>> string to maximize the static prefix when indexing with an analyzer
>> that rotates terms.  If anyone has suggestions/pointers to how this
>> could be accomplished, it'd be most appreciated!
> I think I'll simply treat each term as a potential regex and
> use alphanumeric characters for the prefix. I'll try and leave
> parsing of the regex to the java regex package as much as possible.

Simply using alphanumeric to determine the prefix doesn't work in all  
cases though.  Here's an example I just added, commented out, to  

  assertEquals(1, regexQueryNrHits("r?over"));

The 'r' is optional, yet it is chosen as a prefix and thus no  
documents match even though "over" does match that regex.  The quick  
fix for this is, like FuzzyQuery, to enumerate all terms blindly.   
Another option is to allow the user of (Span)RegexQuery to provide  
the prefix, at least pushing the burden back a bit.

I still think a regex parser would be mighty useful for this issue,  
as well as the next...

> Rotating from the suffix should also be straightforward for
> alphanumerical chars.

It's not as trivial as this, if you want the rotated prefixes to be  
maximized.  Take, for example, the regex expression


Using rotation to maximize the prefix and speeding up term  
enumeration a calculation of of the maximum non-regex piece is  
needed, including a calculation on whether the head and tail combined  
make a larger prefix.  For example, using '$' to denote the end of  
the string, the rotated version of this should be:


With a regex parse tree, it should be possible to be wise about what  
is a static prefix and to compute the size of all the static pieces  
allowing for clever rotation to make regex queries as efficient as  
possible.  Now where is that regex parser grammar?!  :/

> I'd like the surround parser to be a power tool that provides
> everything that Lucene has under the hood. Regexes fit well,
> because they are already used for truncation, only the
> truncation syntax will have a dot added as far as I can see now.

I'm still missing why brackets don't factor into the equation.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message