lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hankyu Kim <gksr...@gmail.com>
Subject Re: Query beginning with special characters
Date Mon, 14 Jan 2013 12:04:21 GMT
I just found the cause of error and you were right about my code being the
source.
I used "Character.getNumericValue(termBuffer[0]) == -1" to test if
termBuffer[0] is equal to null, but apparently the special characters
return -1 as well when given as parameter.

Thank you for your help.

2013/1/14 Hankyu Kim <gksrb92@gmail.com>

> I did intend to ignore all the spaces, so that's not the problem.
>
> Here's the tokenization chain in customAnalyser class, extending Analyser
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
>         NGramTokenizer src = new NGramTokenizer(matchVersion, reader); //
> My NGramTokenizer
>
>         TokenStream tok = new LowerCaseFilter(matchVersion, src);
>         return new TokenStreamComponents(src, tok);
>     }
>
> NGramTokenizer's incrementToken() method.
>     @Override
>     public boolean incrementToken() throws IOException
>     {
>         clearAttributes();
>         char[] termBuffer = termAtt.buffer();
>         termAtt.setLength(GRAM_SIZE);
>
>         startOffset++;                            // Values for offset
> attribute
>         offsetAtt.setOffset(startOffset, startOffset+ GRAM_SIZE-1);
>
>         do
>         {
>             termBuffer[0] = termBuffer[1];            // Shift characters
> to left
>             termBuffer[1] = termBuffer[2];
>
>             // Get next non-whitespace character
>             int c = ' ';
>             while(Character.isWhitespace(c))
>             {
>                 if(position >= dataLength) // Read in buffer, if position
> gets out of bound
>                 {
>                     if(charUtils.fill(iobuffer, input))
>                     {
>                         dataLength = iobuffer.getLength();
>                         position = 0;
>                     }
>                     else    // EOF
>                         return false;
>                 }
>
>                 c = charUtils.codePointAt(iobuffer.getBuffer(),
> position);    // Get next character
>                 position++;
>             }
>
>             Character.toChars(c, termBuffer, GRAM_SIZE-1);
> //
> System.out.print("'"+termBuffer[0]+termBuffer[1]+termBuffer[2]+"', ");
> // This is how I got the output in the last email
>
>         }
>         while(Character.getNumericValue(termBuffer[0]) == -1);
>
>         return true;
>
>     }
>
> 2013/1/14 Ian Lea <ian.lea@gmail.com>
>
>> In fact I see you are ignoring all spaces between words.  Maybe that's
>> deliberate.  Break it down into the smallest possible complete code
>> sample that shows the problem and post that.
>>
>>
>> --
>> Ian.
>>
>>
>> On Mon, Jan 14, 2013 at 11:02 AM, Ian Lea <ian.lea@gmail.com> wrote:
>> > It won't be IndexWriter or IndexWriterConfig.  What exactly does your
>> > analyzer do - what is the full chain of tokenization?  Are you saying
>> > that  ':)a' and ')an' are not indexed?  Surely that is correct given
>> > your input with a space after the :).  And before as well so 's:)', is
>> > also suspect.
>> >
>> > --
>> > Ian.
>> >
>> >
>> > On Mon, Jan 14, 2013 at 7:42 AM, Hankyu Kim <gksrb92@gmail.com> wrote:
>> >> I'm working with Lucene 4.0 and I didn't use lucene's QueryParser, so
>> >> setAllowLeadingWildcard() is irrelevant.
>> >> I also realised the issue wasn't with querying, but it was indexing
>> whihch
>> >> left the terms with leading special character out.
>> >>
>> >> My goal was to do a fuzzymatch by creating a trigram index. The idea
>> is to
>> >> tokenize the documents into trigrams, not by words during indexing and
>> >> searching so lucene can search for part of a word or phrase.
>> >>
>> >> Say the original text in the document said : "Sample text with special
>> >> characters :) and such"
>> >> It's tokenized into
>> >>  'sam', 'amp', 'mpl', 'ple', 'let', 'ete', 'tex', 'ext', 'xtw', 'twi',
>> >> 'wit', 'ith', 'ths', 'hsp', 'spe', 'pec', 'eci', 'cia', 'ial', 'alc',
>> >> 'lch', 'cha', 'har', 'ara', 'rac', 'act', 'cte', 'ter', 'ers', 'rs:',
>> >> 's:)', ':)a', ')an', 'and', 'nds', 'dsu', 'suc', 'uch'.
>> >> The above is output from my tokenizer so there's nothing wrong with
>> >> creating trigrmas. However, when I check the index with lukeall, all
>> the
>> >> other trigrams are indexed correctly except for the terms ':)a' and
>> ')an'.
>> >> Since the missing indexes are related to lucene's special characters, I
>> >> don't think it's got to do with my custom code.
>> >>
>> >> I only changed analyser in the IndexFiles.java from demo to index the
>> file.
>> >> Honestly, I can't locate even the exact class in which the problem is
>> >> caused. I'm only guessing IndexWriterConfig or IndexWriter is
>> discarding
>> >> the terms with leading special characters.
>> >>
>> >> I hope the above infromation helps.
>> >>
>> >> 2013/1/11 Ian Lea <ian.lea@gmail.com>
>> >>
>> >>> QueryParser has a setAllowLeadingWildcard() method.  Could that be
>> >>> relevant?
>> >>>
>> >>> What version of lucene?  Can you post some simple examples of what
>> >>> does/doesn't work? Post the smallest possible, but complete, code that
>> >>> demonstrates the problem?
>> >>>
>> >>>
>> >>> With any question that mentions a custom version of something, that
>> >>> custom version has to be the prime suspect for any problems.
>> >>>
>> >>>
>> >>> --
>> >>> Ian.
>> >>>
>> >>>
>> >>> On Thu, Jan 10, 2013 at 12:08 PM, Hankyu Kim <gksrb92@gmail.com>
>> wrote:
>> >>> > Hi.
>> >>> >
>> >>> > I've created a custom analyzer that treats special characters just
>> like
>> >>> any
>> >>> > other. The index works fine all the time even when the query
>> includes
>> >>> > special characters, except when the special characters come to
the
>> >>> begining
>> >>> > of the query.
>> >>> >
>> >>> > I'm using spanTermQuery and wildCardQuery, and they both seem to
>> suffer
>> >>> the
>> >>> > same issue with queries begining with special characters. Is it
a
>> >>> > limitation of Lucene or am I missing something?
>> >>> >
>> >>> > Thanks
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message