lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: Analyzer for supporting hyphenated words
Date Wed, 22 Jul 2015 10:50:28 GMT
Yes what I meant is that you actually can use your analyser when the query
is not in the quotes.
When in the quotes you can directly build  a term Query out of it.
Now of course it is not so simple scenario, do you think quoted query and
not quoted query parts are 2 different set of queries, which intersection
is always empty ? i.e. a user OR ask for a quoted query OR for a classic
query ?
In that scenario it will be simple.

In the case of a mix, we should take a look better to the lucene query
parser code and see how the tokenization of content within quotes is
handled.

Cheers

2015-07-22 11:32 GMT+01:00 Diego Socaceti <socaceti@gmail.com>:

> sorry little code refactoring typo: curTokenProcessed should be
> userCriteriaProcessed
>
> ...
>
> public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
>
> ...
>
>   if (isExactCriteriaString(userCriteria)) {
>     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
>         escape(userCriteria.substring(1, userCriteria.length() - 1)));
>     userCriteriaProcessed = userCriteriaEscaped;
>   } else {
>     userCriteriaProcessed = escape(userCriteria);
>
>     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
>       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
>     }
>   }
>
>
>   String queryStr = "";
>
>   for (String fieldName : fields) {
>     String escapedFieldName = escape(fieldName);
>     queryStr += String.format("%s:%s ", escapedFieldName,
> userCriteriaProcessed);
>   }
>
>   query = new QueryParser("", analyzer).parse(queryStr.trim());
>
> ...
>
> On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti <socaceti@gmail.com>
> wrote:
>
> > Hi Alessandro,
> >
> > sorry, that i forgot the important part. Here it is:
> >
> > ...
> >
> > public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> > public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
> >
> > ...
> >
> >   if (isExactCriteriaString(userCriteria)) {
> >     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
> >         escape(userCriteria.substring(1, userCriteria.length() - 1)));
> >     userCriteriaProcessed = userCriteriaEscaped;
> >   } else {
> >     userCriteriaProcessed = escape(userCriteria);
> >
> >     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
> >       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
> >     }
> >   }
> >
> >
> >   String queryStr = "";
> >
> >   for (String fieldName : fields) {
> >     String escapedFieldName = escape(fieldName);
> >     queryStr += String.format("%s:%s ", escapedFieldName,
> > curTokenProcessed);
> >   }
> >
> >   query = new QueryParser("", analyzer).parse(queryStr.trim());
> >
> > ...
> >
> >
> > As far as i understand my problem is, that in my - naive query syntax
> > based solution -
> > i have to use my analyzer, which means that the userCriteria is always
> > tokenized.
> >
> > You suggest to use the java query classes to build the query, because
> than
> > i can
> > control if the userCriteria will be tokenized or not.
> > Did i get you right?
> >
> >
> > Thanks and Kind regards
> >
> > On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti <
> > benedetti.alex85@gmail.com> wrote:
> >
> >> I read briefly, correct me if I am wrong, but that is to parse the
> content
> >> within the quotes " .
> >> But we are still at a String level.
> >> I want to see how you build the phraseQuery :)
> >> Taking a look to the code the PhraseQuery allow you to add as many terms
> >> you want.
> >>
> >> What you need to do, it's to not tokenise the content within the quotes
> >> and
> >> create actually a TermQuery ( in your case you are not even using the
> >> feature offered by the phrase query regarding positions, you simply want
> >> to
> >> run a TermQuery) .
> >>
> >> So to clarify you should parse the content within the quotes ( as you
> are
> >> doing), than building a TermQuery out of that String, not tokenized at
> >> all.
> >>
> >> Does this make sense to you ?
> >> Can I see what you do after identifying the content within the quotes ?
> >>
> >> Cheers
> >>
> >>
> >> 2015-07-22 10:20 GMT+01:00 Diego Socaceti <socaceti@gmail.com>:
> >>
> >> > Hi Alessandro,
> >> >
> >> > i guess code says more than worlds :)
> >> >
> >> > ...
> >> >
> >> > public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> >> > public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
> >> >
> >> > ...
> >> >
> >> >   if (isExactCriteriaString(userCriteria)) {
> >> >     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
> >> >         escape(userCriteria.substring(1, userCriteria.length() - 1)));
> >> >     userCriteriaProcessed = userCriteriaEscaped;
> >> >   } else {
> >> >     userCriteriaProcessed = escape(userCriteria);
> >> >
> >> >     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
> >> >       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
> >> >     }
> >> >   }
> >> >
> >> > ...
> >> >
> >> > public static String escape(String s) {
> >> >   String result = s;
> >> >
> >> >   if (s != null && !s.trim().isEmpty()) {
> >> >     String toEscape = s.trim();
> >> >
> >> >     if (toEscape.contains("*")) {
> >> >       StringBuilder sb = new StringBuilder();
> >> >
> >> >       for (int i = 0; i < toEscape.length(); i++) {
> >> >         char curChar = toEscape.charAt(i);
> >> >         if (curChar == '*')
> >> >           sb.append('*');
> >> >         else
> >> >           sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
> >> >       }
> >> >
> >> >       result = sb.toString();
> >> >     } else {
> >> >       result = QueryParser.escape(toEscape);
> >> >     }
> >> >   }
> >> >
> >> >   return result;
> >> > }
> >> >
> >> > ...
> >> >
> >> > Thanks and Kind regards
> >> >
> >> >
> >> >
> >> > On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti <
> >> > benedetti.alex85@gmail.com> wrote:
> >> >
> >> > > As a start Diego, how do you currently parse the user query to build
> >> the
> >> > > Lucene queries ?
> >> > >
> >> > > Cheers
> >> > >
> >> > > 2015-07-22 8:35 GMT+01:00 Diego Socaceti <socaceti@gmail.com>:
> >> > >
> >> > > > Hi Alessandro,
> >> > > >
> >> > > > yes, i want the user to be able to surround the query with ""
to
> run
> >> > the
> >> > > > phrase query with a NOT tokenized phrase.
> >> > > >
> >> > > > What do i have to do?
> >> > > >
> >> > > > Thanks and Kind regards
> >> > > >
> >> > > > On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti <
> >> > > > benedetti.alex85@gmail.com> wrote:
> >> > > >
> >> > > > > Hey Jack, reading the doc :
> >> > > > >
> >> > > > > " Set to true if phrase queries will be automatically generated
> >> when
> >> > > the
> >> > > > > analyzer returns more than one term from whitespace delimited
> >> text.
> >> > > NOTE:
> >> > > > > this behavior may not be suitable for all languages.
> >> > > > >
> >> > > > > Set to false if phrase queries should only be generated
when
> >> > surrounded
> >> > > > by
> >> > > > > double quotes."
> >> > > > >
> >> > > > >
> >> > > > > In the user case , i guess he's likely to use double quotes.
> >> > > > >
> >> > > > > The only problem he sees so far is that the phrase query
uses
> the
> >> > query
> >> > > > > time analyser to actually split the tokens.
> >> > > > >
> >> > > > > First we need a feedback from him, but I guess he would
like to
> >> have
> >> > > the
> >> > > > > phrase query, to not tokenise the text within the double
quotes.
> >> > > > >
> >> > > > > In the case we should find a way.
> >> > > > >
> >> > > > >
> >> > > > > Cheers
> >> > > > >
> >> > > > > 2015-07-21 13:12 GMT+01:00 Jack Krupansky <
> >> jack.krupansky@gmail.com
> >> > >:
> >> > > > >
> >> > > > > > If you don't explicitly enable automatic phrase queries,
the
> >> Lucene
> >> > > > query
> >> > > > > > parser will assume an OR operator on the sub-terms
when a
> white
> >> > > > > > space-delimited term analyzes into a sequence of terms.
> >> > > > > >
> >> > > > > > See:
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
> >> > > > > >
> >> > > > > >
> >> > > > > > -- Jack Krupansky
> >> > > > > >
> >> > > > > > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <
> >> > socaceti@gmail.com>
> >>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi all,
> >> > > > > > >
> >> > > > > > > i'm new to lucene and tried to write my own analyzer
to
> >> support
> >> > > > > > > hyphenated words like wi-fi, jean-pierre, etc.
> >> > > > > > > For our customer it is important to find the word
> >> > > > > > > - wi-fi by wi, fi, wifi, wi-fi
> >> > > > > > > - jean-pierre by jean, pierre, jean-pierre, jean-*
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > The analyzer:
> >> > > > > > > public class SupportHyphenatedWordsAnalyzer extends
> Analyzer {
> >> > > > > > >
> >> > > > > > >   protected NormalizeCharMap charConvertMap;
> >> > > > > > >
> >> > > > > > >   public MinLuceneAnalyzer() {
> >> > > > > > >     initCharConvertMap();
> >> > > > > > >   }
> >> > > > > > >
> >> > > > > > >   protected void initCharConvertMap() {
> >> > > > > > >     NormalizeCharMap.Builder builder = new
> >> > > > NormalizeCharMap.Builder();
> >> > > > > > >     builder.add("\"", "");
> >> > > > > > >     charConvertMap = builder.build();
> >> > > > > > >   }
> >> > > > > > >
> >> > > > > > >   @Override
> >> > > > > > >   protected TokenStreamComponents createComponents(final
> >> String
> >> > > > > > fieldName)
> >> > > > > > > {
> >> > > > > > >
> >> > > > > > >     final Tokenizer src = new WhitespaceTokenizer();
> >> > > > > > >
> >> > > > > > >     TokenStream tok = new WordDelimiterFilter(src,
> >> > > > > > >         WordDelimiterFilter.PRESERVE_ORIGINAL
> >> > > > > > >             | WordDelimiterFilter.GENERATE_WORD_PARTS
> >> > > > > > >             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> >> > > > > > >             | WordDelimiterFilter.CATENATE_WORDS,
> >> > > > > > >         null);
> >> > > > > > >     tok = new LowerCaseFilter(tok);
> >> > > > > > >     tok = new LengthFilter(tok, 1, 255);
> >> > > > > > >     tok = new StopFilter(tok,
> >> > StopAnalyzer.ENGLISH_STOP_WORDS_SET);
> >> > > > > > >
> >> > > > > > >     return new TokenStreamComponents(src, tok);
> >> > > > > > >   }
> >> > > > > > >
> >> > > > > > >   @Override
> >> > > > > > >   protected Reader initReader(String fieldName,
Reader
> >> reader) {
> >> > > > > > >     return new MappingCharFilter(charConvertMap,
reader);
> >> > > > > > >   }
> >> > > > > > > }
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > The analyzer seems to work except for exact phrase
match
> >> queries.
> >> > > > > > >
> >> > > > > > > e.g. the following words are indexed
> >> > > > > > >
> >> > > > > > > FD-A320-REC-SIM-1
> >> > > > > > > FD-A320-REC-SIM-10
> >> > > > > > > FD-A320-REC-SIM-11
> >> > > > > > > MIA-FD-A320-REC-SIM-1
> >> > > > > > > SIN-FD-A320-REC-SIM-1
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > The (exact) query "FD-A320-REC-SIM-1" returns
> >> > > > > > > FD-A320-REC-SIM-1
> >> > > > > > > MIA-FD-A320-REC-SIM-1
> >> > > > > > > SIN-FD-A320-REC-SIM-1
> >> > > > > > >
> >> > > > > > > for our customer this is wrong because this exact
phrase
> match
> >> > > > > > > query should only return the single entry FD-A320-REC-SIM-1
> >> > > > > > >
> >> > > > > > > Do you have any ideas or tips, how we have to
change our
> >> current
> >> > > > > > > analyzer to support this requirement???
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Thanks and Kind regards
> >> > > > > > > Diego
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > --------------------------
> >> > > > >
> >> > > > > Benedetti Alessandro
> >> > > > > Visiting card - http://about.me/alessandro_benedetti
> >> > > > > Blog - http://alexbenedetti.blogspot.co.uk
> >> > > > >
> >> > > > > "Tyger, tyger burning bright
> >> > > > > In the forests of the night,
> >> > > > > What immortal hand or eye
> >> > > > > Could frame thy fearful symmetry?"
> >> > > > >
> >> > > > > William Blake - Songs of Experience -1794 England
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > --------------------------
> >> > >
> >> > > Benedetti Alessandro
> >> > > Visiting card - http://about.me/alessandro_benedetti
> >> > > Blog - http://alexbenedetti.blogspot.co.uk
> >> > >
> >> > > "Tyger, tyger burning bright
> >> > > In the forests of the night,
> >> > > What immortal hand or eye
> >> > > Could frame thy fearful symmetry?"
> >> > >
> >> > > William Blake - Songs of Experience -1794 England
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> --------------------------
> >>
> >> Benedetti Alessandro
> >> Visiting card - http://about.me/alessandro_benedetti
> >> Blog - http://alexbenedetti.blogspot.co.uk
> >>
> >> "Tyger, tyger burning bright
> >> In the forests of the night,
> >> What immortal hand or eye
> >> Could frame thy fearful symmetry?"
> >>
> >> William Blake - Songs of Experience -1794 England
> >>
> >
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message