lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: QueryParser with CustomAnalyzer wrongly uses PatternReplaceCharFilter
Date Thu, 28 Apr 2016 12:28:24 GMT
Classic QueryParser splits on whitespace and then sends the chunks to the analyzer one at a
time.  See <https://issues.apache.org/jira/browse/LUCENE-2605>.

--
Steve
www.lucidworks.com

> On Apr 28, 2016, at 5:54 AM, Bahaa Eldesouky <bahaabeih@gmail.com> wrote:
> 
> I am using org.apache.lucene.queryparser.classic.QueryParser in lucene
> 6.0.0 to parse queries using a CustomAnalyzer as shown below:
> 
> public static void testFilmAnalyzer() throws IOException, ParseException {
>    CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder()
>            .addCharFilter("patternreplace",
>                    "pattern", "(movie|film|picture).*",
>                    "replacement", "")
>            .withTokenizer("standard")
>            .build();
> 
>    QueryParser qp = new QueryParser("name", nameAnalyzer);
>    qp.setDefaultOperator(QueryParser.Operator.AND);
>    String[] strs = {"avatar film fiction", "avatar-film fiction",
> "avatar-film-fiction"};
> 
>    for (String str : strs) {
>        System.out.println("Analyzing \"" + str + "\":");
>        showTokens(str, nameAnalyzer);
>        Query q = qp.parse(str);
>        System.out.println("Parsed query of \"" + str + "\":");
>        System.out.println(q + "\n");
>    }}
> private static void showTokens(String text, Analyzer analyzer) throws
> IOException {
>    StringReader reader = new StringReader(text);
>    TokenStream stream = analyzer.tokenStream("name", reader);
>    CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>    stream.reset();
>    while (stream.incrementToken()) {
>        System.out.print("[" + term.toString() + "]");
>    }
>    stream.close();
>    System.out.println();}
> 
> 
> 
> 
> I get the following output, when I invoke testFilmAnalyzer():
> 
> Analyzing "avatar film fiction":[avatar]Parsed query of "avatar film
> fiction":+name:avatar +name:fiction
> Analyzing "avatar-film fiction":[avatar]Parsed query of "avatar-film
> fiction":+name:avatar +name:fiction
> Analyzing "avatar-film-fiction":[avatar]Parsed query of "avatar-film-fiction":
> name:avatar
> 
> 
> It seems like the analyzer uses the PatternReplaceCharFilter in its correct
> intended order (i.e. before tokenization), while the QueryParser does so
> afterwards. Does anyone have an explanation for that? Isn't that a bug?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message