lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Teresa McMains <ter...@t14-consulting.com>
Subject RE: match string fields with embedded hyphens
Date Mon, 06 Apr 2020 16:38:12 GMT
Erick, thank you so much for this.  I'm going to try to implement with PatternReplaceCharFilterFactory
as you recommended.
What you mentioned about re-indexing from an empty state made sense to me (in terms of the
observed behavior) but also surprised me.  If I select "Clean" on the reindex, does it *not*
start from an empty state?

Thanks!!
Teresa


-----Original Message-----
From: Erick Erickson <erickerickson@gmail.com> 
Sent: Friday, April 3, 2020 7:16 PM
To: solr-user@lucene.apache.org
Subject: Re: match string fields with embedded hyphens

First, thanks for taking the time to write up a clear problem statement. Putting in the field
type is _really_ helpful.

By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. The problem
is that wildcards are tricky, and this trips everybody up at one time or another.

The quick background is that if there’s any possibility that the filter can produce multiple
tokens for a single input token, that filter is skipped during analysis at _query_ time. Imagine
that your replacement was a space rather than an empty string. Then 123--456 would become
_two_ tokens in subsequent processing. Now anything you do is wrong sometime, somewhere. 

For instance, 123*456 would fail because it’d be looking for one token, which you wouldn’t
expect. 12345* would also fail because there’s no single token like that. 123 would succeed
(note no wildcard). You can see where this is going.

Which doesn’t help you solve your use-case. There are several options:

- use <charFilter class="solr.PatternReplaceCharFilterFactory"  pattern="[^A-Za-z0-9]"
replacement="”/> instead of PatternReplaceFilterFactory. charFilters are applied to the
raw input before analysis and don’t have the same problem with producing multiple tokens.

- WordDelimiter(Graph)FilterFactory is built for this kind of thing. There are a number of
options, and this is one of the few filters that’s often different between index and query
analysis chains. It can be tricky to understand all the interactions of the parameters though.

And as an aside, I don’t know how large your index is, but wildcards for one or two leading
characters can get very expensive, i.e. 1*, 12* can get very costly. If you can require 3
or more leading characters there are rarely problems. You can also do a time/space tradeoff
by including EdgeNgramFilterFactory in your chain at the cost of a larger index.

And finally, (and this is a total nit) there.s no reason to specify lower-case characters
in your existing pattern because the upper-case filter is first. You _will_ have to specify
uppercase characters if you use the charfilter.

As for why production is different than QA, my guess is that you overlaid the schema changes
on an _existing_ index. Most of the time, to get consistent results, you must re-index everything
starting from an _empty_ index. This is a long and complicated explanation that I won’t
go into here. In fact, I usually do one of two things:

1> define a new collection/core and index to that. If using SolrCloud, you can re-index
and use collection aliasing to seamlessly switch.

2> stop Solr. Delete all the datadirs (the parent of tlog and index) associated with any
of my replicas, restart with Solr and index. You may be able to get away with using delete-by-query
to remove everything in your index then optimize (one of the very few times I’ll recommend
optimizing), reloading your collection and indexing. The point is to get rid of all traces
of anything generated from the old schema. 

Best,
Erick

> On Apr 3, 2020, at 3:40 PM, Teresa McMains <teresa@t14-consulting.com> wrote:
> 
> Forgive me if this is unclear, I am very much new here.
> 
> I am working with a customer who needs to be able to query various account/customer ID
fields which may or may not have embedded dashes.  But they want to be able to search by entering
the dashes or not and by entering partial values or not.
> 
> So we may have an account or customer ID like
> 
> 1234-56AB45
> 
> And they would like to retrieve this by searching for any of the following:
> 1234-56AB45     (full string match)
> 1234-56                (partial string match)
> 123456AB45        (full string but no dashes)
> 123456                  (partial string no dashes)
> 
> I've defined this field type in schema.xml as:
> 
> 
> <!-- String replace field for account number searches -->
> 
> <fieldType name="TrimmedString" class="solr.TextField" 
> omitNorms="true">
> 
> <analyzer>
> 
>  <tokenizer class="solr.KeywordTokenizerFactory" />
> 
> 
>  <!-- Normalizes token text to upper case -->
> 
>  <filter class="solr.UpperCaseFilterFactory" />
> 
>  <!-- Removes anything that isn't a letter or digit -->
> 
>  <filter class="solr.PatternReplaceFilterFactory" 
> pattern="[^A-Za-z0-9]" replacement="" replace="all"/>
> 
> 
> 
> </analyzer>
> 
> </fieldType>
> 
> But the behavior I see is completely unexpected.
> Full string match works fine on the customer's DEV environment but not 
> in QA (which is running the same version of SOLR) Partial string match 
> works for some ID fields but not others A Partial string match when 
> the user does not enter the dashes just never works
> 
> I don't even know where to begin.  The behavior is not consistent enough to give me a
sense.
> 
> So perhaps I will just ask - how would you define a fieldType which should ignore special
characters like hyphens or underscores (or anything non-alphanumeric) and works for full string
or partial string search?
> 
> Thank you.
> 
> 

Mime
View raw message