lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego Socaceti <socac...@gmail.com>
Subject Analyzer for supporting hyphenated words
Date Fri, 17 Jul 2015 08:41:24 GMT
Hi all,

i'm new to lucene and tried to write my own analyzer to support
hyphenated words like wi-fi, jean-pierre, etc.
For our customer it is important to find the word
- wi-fi by wi, fi, wifi, wi-fi
- jean-pierre by jean, pierre, jean-pierre, jean-*




The analyzer:
public class SupportHyphenatedWordsAnalyzer extends Analyzer {

  protected NormalizeCharMap charConvertMap;

  public MinLuceneAnalyzer() {
    initCharConvertMap();
  }

  protected void initCharConvertMap() {
    NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    builder.add("\"", "");
    charConvertMap = builder.build();
  }

  @Override
  protected TokenStreamComponents createComponents(final String fieldName) {

    final Tokenizer src = new WhitespaceTokenizer();

    TokenStream tok = new WordDelimiterFilter(src,
        WordDelimiterFilter.PRESERVE_ORIGINAL
            | WordDelimiterFilter.GENERATE_WORD_PARTS
            | WordDelimiterFilter.GENERATE_NUMBER_PARTS
            | WordDelimiterFilter.CATENATE_WORDS,
        null);
    tok = new LowerCaseFilter(tok);
    tok = new LengthFilter(tok, 1, 255);
    tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);

    return new TokenStreamComponents(src, tok);
  }

  @Override
  protected Reader initReader(String fieldName, Reader reader) {
    return new MappingCharFilter(charConvertMap, reader);
  }
}





The analyzer seems to work except for exact phrase match queries.

e.g. the following words are indexed

FD-A320-REC-SIM-1
FD-A320-REC-SIM-10
FD-A320-REC-SIM-11
MIA-FD-A320-REC-SIM-1
SIN-FD-A320-REC-SIM-1


The (exact) query "FD-A320-REC-SIM-1" returns
FD-A320-REC-SIM-1
MIA-FD-A320-REC-SIM-1
SIN-FD-A320-REC-SIM-1

for our customer this is wrong because this exact phrase match
query should only return the single entry FD-A320-REC-SIM-1

Do you have any ideas or tips, how we have to change our current
analyzer to support this requirement???


Thanks and Kind regards
Diego

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message