lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bahaa Eldesouky <bahaab...@gmail.com>
Subject QueryParser with CustomAnalyzer wrongly uses PatternReplaceCharFilter
Date Thu, 28 Apr 2016 09:54:22 GMT
 I am using org.apache.lucene.queryparser.classic.QueryParser in lucene
6.0.0 to parse queries using a CustomAnalyzer as shown below:

public static void testFilmAnalyzer() throws IOException, ParseException {
    CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder()
            .addCharFilter("patternreplace",
                    "pattern", "(movie|film|picture).*",
                    "replacement", "")
            .withTokenizer("standard")
            .build();

    QueryParser qp = new QueryParser("name", nameAnalyzer);
    qp.setDefaultOperator(QueryParser.Operator.AND);
    String[] strs = {"avatar film fiction", "avatar-film fiction",
"avatar-film-fiction"};

    for (String str : strs) {
        System.out.println("Analyzing \"" + str + "\":");
        showTokens(str, nameAnalyzer);
        Query q = qp.parse(str);
        System.out.println("Parsed query of \"" + str + "\":");
        System.out.println(q + "\n");
    }}
private static void showTokens(String text, Analyzer analyzer) throws
IOException {
    StringReader reader = new StringReader(text);
    TokenStream stream = analyzer.tokenStream("name", reader);
    CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
    stream.reset();
    while (stream.incrementToken()) {
        System.out.print("[" + term.toString() + "]");
    }
    stream.close();
    System.out.println();}




I get the following output, when I invoke testFilmAnalyzer():

Analyzing "avatar film fiction":[avatar]Parsed query of "avatar film
fiction":+name:avatar +name:fiction
Analyzing "avatar-film fiction":[avatar]Parsed query of "avatar-film
fiction":+name:avatar +name:fiction
Analyzing "avatar-film-fiction":[avatar]Parsed query of "avatar-film-fiction":
name:avatar


It seems like the analyzer uses the PatternReplaceCharFilter in its correct
intended order (i.e. before tokenization), while the QueryParser does so
afterwards. Does anyone have an explanation for that? Isn't that a bug?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message