lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: Analyzer for supporting hyphenated words
Date Wed, 22 Jul 2015 09:44:17 GMT
I read briefly, correct me if I am wrong, but that is to parse the content
within the quotes " .
But we are still at a String level.
I want to see how you build the phraseQuery :)
Taking a look to the code the PhraseQuery allow you to add as many terms
you want.

What you need to do, it's to not tokenise the content within the quotes and
create actually a TermQuery ( in your case you are not even using the
feature offered by the phrase query regarding positions, you simply want to
run a TermQuery) .

So to clarify you should parse the content within the quotes ( as you are
doing), than building a TermQuery out of that String, not tokenized at all.

Does this make sense to you ?
Can I see what you do after identifying the content within the quotes ?

Cheers


2015-07-22 10:20 GMT+01:00 Diego Socaceti <socaceti@gmail.com>:

> Hi Alessandro,
>
> i guess code says more than worlds :)
>
> ...
>
> public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
>
> ...
>
>   if (isExactCriteriaString(userCriteria)) {
>     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
>         escape(userCriteria.substring(1, userCriteria.length() - 1)));
>     userCriteriaProcessed = userCriteriaEscaped;
>   } else {
>     userCriteriaProcessed = escape(userCriteria);
>
>     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
>       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
>     }
>   }
>
> ...
>
> public static String escape(String s) {
>   String result = s;
>
>   if (s != null && !s.trim().isEmpty()) {
>     String toEscape = s.trim();
>
>     if (toEscape.contains("*")) {
>       StringBuilder sb = new StringBuilder();
>
>       for (int i = 0; i < toEscape.length(); i++) {
>         char curChar = toEscape.charAt(i);
>         if (curChar == '*')
>           sb.append('*');
>         else
>           sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
>       }
>
>       result = sb.toString();
>     } else {
>       result = QueryParser.escape(toEscape);
>     }
>   }
>
>   return result;
> }
>
> ...
>
> Thanks and Kind regards
>
>
>
> On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
> > As a start Diego, how do you currently parse the user query to build the
> > Lucene queries ?
> >
> > Cheers
> >
> > 2015-07-22 8:35 GMT+01:00 Diego Socaceti <socaceti@gmail.com>:
> >
> > > Hi Alessandro,
> > >
> > > yes, i want the user to be able to surround the query with "" to run
> the
> > > phrase query with a NOT tokenized phrase.
> > >
> > > What do i have to do?
> > >
> > > Thanks and Kind regards
> > >
> > > On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti <
> > > benedetti.alex85@gmail.com> wrote:
> > >
> > > > Hey Jack, reading the doc :
> > > >
> > > > " Set to true if phrase queries will be automatically generated when
> > the
> > > > analyzer returns more than one term from whitespace delimited text.
> > NOTE:
> > > > this behavior may not be suitable for all languages.
> > > >
> > > > Set to false if phrase queries should only be generated when
> surrounded
> > > by
> > > > double quotes."
> > > >
> > > >
> > > > In the user case , i guess he's likely to use double quotes.
> > > >
> > > > The only problem he sees so far is that the phrase query uses the
> query
> > > > time analyser to actually split the tokens.
> > > >
> > > > First we need a feedback from him, but I guess he would like to have
> > the
> > > > phrase query, to not tokenise the text within the double quotes.
> > > >
> > > > In the case we should find a way.
> > > >
> > > >
> > > > Cheers
> > > >
> > > > 2015-07-21 13:12 GMT+01:00 Jack Krupansky <jack.krupansky@gmail.com
> >:
> > > >
> > > > > If you don't explicitly enable automatic phrase queries, the Lucene
> > > query
> > > > > parser will assume an OR operator on the sub-terms when a white
> > > > > space-delimited term analyzes into a sequence of terms.
> > > > >
> > > > > See:
> > > > >
> > > > >
> > > >
> > >
> >
> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
> > > > >
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <
> socaceti@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > i'm new to lucene and tried to write my own analyzer to support
> > > > > > hyphenated words like wi-fi, jean-pierre, etc.
> > > > > > For our customer it is important to find the word
> > > > > > - wi-fi by wi, fi, wifi, wi-fi
> > > > > > - jean-pierre by jean, pierre, jean-pierre, jean-*
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > The analyzer:
> > > > > > public class SupportHyphenatedWordsAnalyzer extends Analyzer
{
> > > > > >
> > > > > >   protected NormalizeCharMap charConvertMap;
> > > > > >
> > > > > >   public MinLuceneAnalyzer() {
> > > > > >     initCharConvertMap();
> > > > > >   }
> > > > > >
> > > > > >   protected void initCharConvertMap() {
> > > > > >     NormalizeCharMap.Builder builder = new
> > > NormalizeCharMap.Builder();
> > > > > >     builder.add("\"", "");
> > > > > >     charConvertMap = builder.build();
> > > > > >   }
> > > > > >
> > > > > >   @Override
> > > > > >   protected TokenStreamComponents createComponents(final String
> > > > > fieldName)
> > > > > > {
> > > > > >
> > > > > >     final Tokenizer src = new WhitespaceTokenizer();
> > > > > >
> > > > > >     TokenStream tok = new WordDelimiterFilter(src,
> > > > > >         WordDelimiterFilter.PRESERVE_ORIGINAL
> > > > > >             | WordDelimiterFilter.GENERATE_WORD_PARTS
> > > > > >             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> > > > > >             | WordDelimiterFilter.CATENATE_WORDS,
> > > > > >         null);
> > > > > >     tok = new LowerCaseFilter(tok);
> > > > > >     tok = new LengthFilter(tok, 1, 255);
> > > > > >     tok = new StopFilter(tok,
> StopAnalyzer.ENGLISH_STOP_WORDS_SET);
> > > > > >
> > > > > >     return new TokenStreamComponents(src, tok);
> > > > > >   }
> > > > > >
> > > > > >   @Override
> > > > > >   protected Reader initReader(String fieldName, Reader reader)
{
> > > > > >     return new MappingCharFilter(charConvertMap, reader);
> > > > > >   }
> > > > > > }
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > The analyzer seems to work except for exact phrase match queries.
> > > > > >
> > > > > > e.g. the following words are indexed
> > > > > >
> > > > > > FD-A320-REC-SIM-1
> > > > > > FD-A320-REC-SIM-10
> > > > > > FD-A320-REC-SIM-11
> > > > > > MIA-FD-A320-REC-SIM-1
> > > > > > SIN-FD-A320-REC-SIM-1
> > > > > >
> > > > > >
> > > > > > The (exact) query "FD-A320-REC-SIM-1" returns
> > > > > > FD-A320-REC-SIM-1
> > > > > > MIA-FD-A320-REC-SIM-1
> > > > > > SIN-FD-A320-REC-SIM-1
> > > > > >
> > > > > > for our customer this is wrong because this exact phrase match
> > > > > > query should only return the single entry FD-A320-REC-SIM-1
> > > > > >
> > > > > > Do you have any ideas or tips, how we have to change our current
> > > > > > analyzer to support this requirement???
> > > > > >
> > > > > >
> > > > > > Thanks and Kind regards
> > > > > > Diego
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --------------------------
> > > >
> > > > Benedetti Alessandro
> > > > Visiting card - http://about.me/alessandro_benedetti
> > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > >
> > > > "Tyger, tyger burning bright
> > > > In the forests of the night,
> > > > What immortal hand or eye
> > > > Could frame thy fearful symmetry?"
> > > >
> > > > William Blake - Songs of Experience -1794 England
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message