lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnus Johansson <>
Subject QueryParser and compound words
Date Tue, 11 Mar 2003 10:05:29 GMT

I have written an Analyzer for swedish. Compound words are common in
swedish, therefore my Analyzer tries to split the compound words
into its parts. For example the swedish word fotbollsmatch (football 
game) is split into fotboll and match.

However when I use my Analyzer with the QueryParser the query 
footballsmatch is changed into "fotbolls match" (notice the quotes)
when what I really want is the query fotbolls match (with no qoutes).
Is this possible? The splitting of compound words is
of no real use if I can't get rid of the qoutes.

I have attached some sample code that illustrates the problem
(using a dummy Analyzer that splits words larger than five
charcters into two)



import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.queryParser.QueryParser;


public class TestAnalyzer extends Analyzer {
     public TokenStream tokenStream(String s, Reader reader) {
         return new SplitStream(new StandardTokenizer(reader));

     public static void main(String[] args) throws Exception {
         QueryParser qp = new QueryParser("fieldname",
             new TestAnalyzer());
         Query q = qp.parse("queryparser");
         System.out.println("Query: " + q.toString("fieldname"));
         System.out.println("Correct: query parser");

class SplitStream extends TokenStream {
     private static final int SPLIT_SIZE = 5;
     private TokenStream tstream;
     private String buffer = null;
     private int start, end;

     public SplitStream(TokenStream tstream) { this.tstream = tstream; }

     public Token next() throws IOException {
         if (buffer == null) {
             Token tok =;
             if (tok == null) {
                 return null;
             } else if (tok.termText().length() > SPLIT_SIZE) {
                 buffer = tok.termText().substring(SPLIT_SIZE);
                 start = tok.startOffset() + SPLIT_SIZE;
                 end = tok.endOffset();
                 return new Token(
                     tok.termText().substring(0, SPLIT_SIZE),
                         tok.startOffset() + SPLIT_SIZE);
             } else {
                 return tok;
         } else {
             Token t = new Token(buffer, start, end);
             buffer = null;
             return t;

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message