lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Strip out punctuation at the end of token
Date Fri, 24 Nov 2017 18:35:50 GMT
You need to play with the (many) parameters for WordDelimiterFilterFactory.

For instance, you have preserveOriginal set to 1. That's what's
generating the token with the dot.

You have catenateAll and catenateNumbers set to zero. That means that
someone searching for 61149008 won't get a hit.

The fact that the dot is in the tokens generated doesn't really matter
as long as the query tokens produced will match.

I think you're getting a bit off track by focusing on the hyphen and
dot, you're only seeing them in the index at all since you have
preserveOriginal set to 1. Let's say that you set preserveOriginal to
0 and catenateNumbers to 1. Then you'd get:
61149
008
61149008

in your index. No dots, no hyphens.

Not your _query_ analysis also has catenateNumbers as 1 and
preserveOriginal as 0. The user searches for
61149-008

and the emitted tokens are in the index and you're OK. The user
searches for 61149008 and gets a hit there too. The dot is irrelevant.

now, all that said if that isn't comfortable you could certainly add
PatternReplaceFilterFactory, but really WDFF is designed for this kind
of thing, I think you'll be just fine if you play with the options
enough to understand the nuances, which can be tricky I'll admit..


Best,
Erick

On Fri, Nov 24, 2017 at 7:13 AM, Sergio GarcĂ­a Maroto
<marotosg@gmail.com> wrote:
> Yes. You are right. I understand now.
> Let me explain my issue a bit better with the exact problem i have.
>
> I have this text "Information number  61149-008."
> Using the tokenizers and filters described previously i get this list of
> tokens.
> information
> number
> 61149-008.
> 61149
> 008
>
> Basically last token   "61149-008."  gets tokenized as
> 61149-008.
> 61149
> 008
> User is searching for "61149-008" without dot, so this is not a match.
> I don't want to change the tokenization on the query to avoid altering the
> matches for other cases.
>
> I would like to delete the dot at the end. Basically generate this extra
> token
> information
> number
> 61149-008.
> 61149
> 008
> 61149-008
>
> Not sure if what I am saying make sense or there is other way to do this
> right.
>
> Thanks a lot
> Sergio
>
>
> On 24 November 2017 at 15:31, Shawn Heisey <apache@elyograg.org> wrote:
>
>> On 11/24/2017 2:32 AM, marotosg wrote:
>>
>>> Hi Shaw.
>>> Thanks for your reply. Actually my issue is with the last token. It looks
>>> like for the last token of a string. It keeps the dot.
>>>
>>> In your case Testing. This is a test. Test.
>>>
>>> Keeps the "Test."
>>>
>>> Is there any reason I can't see for that behauviour?
>>>
>>
>> I am really not sure what you're saying here.
>>
>> Every token is duplicated, one has the dot and one doesn't.  This is what
>> you wanted based on what I read in your initial email.
>>
>> Making a guess as to what you're asking about this time: If you're
>> noticing that there isn't a "Test" as the last token on the line for WDF,
>> then I have to tell you that it actually is there, the display was simply
>> too wide for the browser window. Scrolling horizontally would be required
>> to see the whole thing.
>>
>> Thanks,
>> Shawn
>>
>>

Mime
View raw message