lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: QueryParser with CustomAnalyzer wrongly uses PatternReplaceCharFilter
Date Thu, 28 Apr 2016 12:37:01 GMT
Hi,

this is a general problem of using Analyzers in combination with QueryParser. Query Parsing
is done *before* the terms are tokenized: QueryParser uses a JavaCC grammar to parse the query.
This involves some query-parsing specific tokenization. Once the query parser has analyzed
the syntax, it sends the syntactic parts through the analyzer (unfortunately - for english
text - this is tokens only).

You have 2 possibilities:

- Move the pattern replacement as a tokenfilter. This is more likely to help for query parsing
where the tokenization is done by the parser. For your example a StopFilter would be good
(removes some tokens from a list)
- In many cases people use query parsing when it is not applicable. If your users only enter
terms but you don't need any syntax then query parsing is the wrong thing to do. What you
need more is a simplified analysis process that just creates a query out of the tokens emitted
by the Analyzer. Lucene has the QueryBuilder class for that. Query Builder takes an Analyzer
and you can pass in a string that gets tokenized and converted into a query. You have the
option to create simple term queries in a booleanquery or alternatively parse them as a phrase.
If you use this component, the whole analyzer would be used on the input string and Analyzer's
output used to build the query - without any syntax.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Bahaa Eldesouky [mailto:bahaabeih@gmail.com]
> Sent: Thursday, April 28, 2016 11:54 AM
> To: java-user@lucene.apache.org
> Subject: QueryParser with CustomAnalyzer wrongly uses
> PatternReplaceCharFilter
> 
>  I am using org.apache.lucene.queryparser.classic.QueryParser in lucene
> 6.0.0 to parse queries using a CustomAnalyzer as shown below:
> 
> public static void testFilmAnalyzer() throws IOException, ParseException {
>     CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder()
>             .addCharFilter("patternreplace",
>                     "pattern", "(movie|film|picture).*",
>                     "replacement", "")
>             .withTokenizer("standard")
>             .build();
> 
>     QueryParser qp = new QueryParser("name", nameAnalyzer);
>     qp.setDefaultOperator(QueryParser.Operator.AND);
>     String[] strs = {"avatar film fiction", "avatar-film fiction",
> "avatar-film-fiction"};
> 
>     for (String str : strs) {
>         System.out.println("Analyzing \"" + str + "\":");
>         showTokens(str, nameAnalyzer);
>         Query q = qp.parse(str);
>         System.out.println("Parsed query of \"" + str + "\":");
>         System.out.println(q + "\n");
>     }}
> private static void showTokens(String text, Analyzer analyzer) throws
> IOException {
>     StringReader reader = new StringReader(text);
>     TokenStream stream = analyzer.tokenStream("name", reader);
>     CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>     stream.reset();
>     while (stream.incrementToken()) {
>         System.out.print("[" + term.toString() + "]");
>     }
>     stream.close();
>     System.out.println();}
> 
> 
> 
> 
> I get the following output, when I invoke testFilmAnalyzer():
> 
> Analyzing "avatar film fiction":[avatar]Parsed query of "avatar film
> fiction":+name:avatar +name:fiction
> Analyzing "avatar-film fiction":[avatar]Parsed query of "avatar-film
> fiction":+name:avatar +name:fiction
> Analyzing "avatar-film-fiction":[avatar]Parsed query of "avatar-film-fiction":
> name:avatar
> 
> 
> It seems like the analyzer uses the PatternReplaceCharFilter in its correct
> intended order (i.e. before tokenization), while the QueryParser does so
> afterwards. Does anyone have an explanation for that? Isn't that a bug?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message