lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Surround query parser
Date Sun, 18 Apr 2004 12:51:45 GMT
Dear developers,

I'd like to contribute a query parser named Surround.

The implementation uses mostly Lucene's BooleanQuery, TermQuery,
SpanNearQuery, SpanOrQuery and SpanTermQuery. These are chosen
depending on the query operator.

Currently the sources are in a CVS working copy next to a lucene
working copy. There is some test code which uses the latest
lucene jar generated from the lucene working copy.

The source code has cooled down far enough for a
package restructuring. In case there is interest, how would
the sources best be structured? Currently two packages are
used the sources: org.surround.queryparser and
Following the name of org.apache.lucene.wordnet in the sandbox,
would org.apache.lucene.surround be ok.?



Surround consists of these operators (uppercase/lowercase):

AND/OR/NOT/nW/nN/   as infix and
AND/OR/nW/nN        as prefix.

Distance operators W and N have default n=1, max 99.
Implemented as ordered/unordered SpanQuery with slop = (n - 1).
An example prefix form is:

20N(aa*, bb*, cc*)

The name Surround was chosen because of this prefix form
and because it uses the newly introduced span queries
to implement the proximity operators.

The operators and their prefix and infix
forms were borrowed from the user documentation of
various other query languages on the internet.

Wildcards/truncations are the same as in the
Lucene standard query parser:
* for internal and suffix truncation,
? to match one character.

And there is:
^ for boosting a term or a bracketed subquery.

Some examples (best read with fixed size font):

aa and bb
aa and bb or cc        same effect as:  (aa and bb) or cc
aa NOT bb NOT cc       same effect as:  (aa NOT bb) NOT cc

and(aa,bb,cc)          aa and bb and cc
99w(aa,bb,cc)          ordered span query with slop 98
99n(aa,bb,cc)          unordered span query with slop 98

3w(a?a or bb?, cc*)    W subqueries: OR, truncation

title: text: aa
title : text : aa or bb
title:text: aa not bb
title:aa not text:bb

cc 3w dd               infix: dual.

cc N dd N ee           same effect as:   (cc N dd) N ee

text: aa 3n bb         same effect as:    text: (aa 3n bb)

Development status

Not tested: multiple fields, internally mapped to OR queries.

Suffix truncation is implemented very similar to Lucene's PrefixQuery.

Wildcards (? and internal *) are implemented with regular expressions
to allow further variations. A reimplementation using Lucene's
WildCardTermEnum (correct name?) should be no problem.

There is a warning for ordered subqueries with 3 or more subqueries,
due to a pending bug in the ordered SpanNearQuery.

Warnings about missing terms are sent to System.out, this might
be replaced by another stream.

There are no javadoc comments.
I'm using java 1.4.2, so probably there are some dependencies
on java 1.4.
Other tools used: ant 1.6b2 and javacc 3.2.
The build target javacc should be used explicitly
when the .jj file is changed.

The sources, apart from a build.xml file:

... src/java/org/surround/search> wc *.java ../q*/*.jj | sort -r

   1424    4322   40776 total
    436    1404   11140 ../queryparser/QueryParser.jj
    138     484    4582
    106     316    3359
    101     266    2860
     96     245    2480
     95     266    2994
     78     245    2390
     72     218    2044
     60     151    1613
     52     132    1378
     49     158    1446
     46     130    1412
     31      80     826
     22      79     866
     16      54     569
     15      59     512
     11      35     305

And the test code:

... /src/test/org/surround/search> wc *.java | sort -r

    550    1963   16899 total
    203     875    6761
    105     444    3582
     97     272    2805
     55     144    1528
     51     121    1072
     39     107    1151

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message