lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Analyzer for supporting hyphenated words
Date Tue, 21 Jul 2015 12:12:22 GMT
If you don't explicitly enable automatic phrase queries, the Lucene query
parser will assume an OR operator on the sub-terms when a white
space-delimited term analyzes into a sequence of terms.

See:
https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)


-- Jack Krupansky

On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <socaceti@gmail.com> wrote:

> Hi all,
>
> i'm new to lucene and tried to write my own analyzer to support
> hyphenated words like wi-fi, jean-pierre, etc.
> For our customer it is important to find the word
> - wi-fi by wi, fi, wifi, wi-fi
> - jean-pierre by jean, pierre, jean-pierre, jean-*
>
>
>
>
> The analyzer:
> public class SupportHyphenatedWordsAnalyzer extends Analyzer {
>
>   protected NormalizeCharMap charConvertMap;
>
>   public MinLuceneAnalyzer() {
>     initCharConvertMap();
>   }
>
>   protected void initCharConvertMap() {
>     NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
>     builder.add("\"", "");
>     charConvertMap = builder.build();
>   }
>
>   @Override
>   protected TokenStreamComponents createComponents(final String fieldName)
> {
>
>     final Tokenizer src = new WhitespaceTokenizer();
>
>     TokenStream tok = new WordDelimiterFilter(src,
>         WordDelimiterFilter.PRESERVE_ORIGINAL
>             | WordDelimiterFilter.GENERATE_WORD_PARTS
>             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
>             | WordDelimiterFilter.CATENATE_WORDS,
>         null);
>     tok = new LowerCaseFilter(tok);
>     tok = new LengthFilter(tok, 1, 255);
>     tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
>
>     return new TokenStreamComponents(src, tok);
>   }
>
>   @Override
>   protected Reader initReader(String fieldName, Reader reader) {
>     return new MappingCharFilter(charConvertMap, reader);
>   }
> }
>
>
>
>
>
> The analyzer seems to work except for exact phrase match queries.
>
> e.g. the following words are indexed
>
> FD-A320-REC-SIM-1
> FD-A320-REC-SIM-10
> FD-A320-REC-SIM-11
> MIA-FD-A320-REC-SIM-1
> SIN-FD-A320-REC-SIM-1
>
>
> The (exact) query "FD-A320-REC-SIM-1" returns
> FD-A320-REC-SIM-1
> MIA-FD-A320-REC-SIM-1
> SIN-FD-A320-REC-SIM-1
>
> for our customer this is wrong because this exact phrase match
> query should only return the single entry FD-A320-REC-SIM-1
>
> Do you have any ideas or tips, how we have to change our current
> analyzer to support this requirement???
>
>
> Thanks and Kind regards
> Diego
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message