lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lajos <la...@protulae.com>
Subject Help! Issue with tokens in custom synonym filter
Date Mon, 31 Aug 2009 14:32:59 GMT
Hi all,

I've been writing some custom synonym filters and have run into an issue 
with returning a list of tokens. I have a synonym filter that uses the 
WordNet database to extract synonyms. My problem is how to define the 
offsets and position increments in the new Tokens I'm returning.

For an input token, I get a list of synonyms from the WordNet database. 
I then create a List<Token> of those results. Each Token is created with 
the same startOffset, endOffset and positionIncrement of the input 
Token. Is this correct? My understanding from looking at the Lucene 
codebase is that the startOffset/endOffset should be the same, as we are 
referring to the same term in the original text. However, I don't quite 
get the positionIncrement. I understand that it is relative to the 
previous term ... does this mean all my synonyms should have a 
positionIncrement of 0? But whether I use 0 or the positionIncrement of 
the original input Token, Solr seems to ignore the returned tokens ...

This is a summary of what is in my filter:

*************************************************

private Iterator<Token> output;
private ArrayList<Token> synonyms = null;

public Token next(Token in) throws IOException {
   if (output != null) {
     // Here we are just outputing matched synonyms
     // that we previously created from the input token
     // The input token has already been returned
     if (output.hasNext()) {
       return output.next();
     } else {
       return null;
     }
   }

   synonyms = new ArrayList<Token>();

   Token t = input.next(in);
   if (t == null) return null;

   String value = new String(t.termBuffer(), 0,
     t.termLength()).toLowerCase();

   // Get list of WordNet synonyms (code removed)
   // Iterate thru WordNet synonyms
   for (String wordNetSyn : wordNetSyns) {
     Token synonym = new Token(t.startOffset(), t.endOffset(), 
t.type());	    synonym.setPositionIncrement(t.getPositionIncrement());
     synonym.setTermBuffer(wordNetSyn .toCharArray(), 0,
       wordNetSyn .length());
     synonyms.add(synonym);
   }

   output = synonyms.iterator();

   // Return the original word, we want it
   return t;
}

Mime
View raw message