lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: match string fields with embedded hyphens
Date Fri, 03 Apr 2020 23:23:35 GMT

: I am working with a customer who needs to be able to query various 
: account/customer ID fields which may or may not have embedded dashes.  
: But they want to be able to search by entering the dashes or not and by 
: entering partial values or not.
: 
: So we may have an account or customer ID like
: 
: 1234-56AB45
: 
: And they would like to retrieve this by searching for any of the following:
: 1234-56AB45     (full string match)
: 1234-56                (partial string match)
: 123456AB45        (full string but no dashes)
: 123456                  (partial string no dashes)

To answer your lsat question first...

: So perhaps I will just ask - how would you define a fieldType which 
: should ignore special characters like hyphens or underscores (or 
: anything non-alphanumeric) and works for full string or partial string 
: search?

This is pretty much exactly what the "Word Delimiter Filter" was designed 
for, and i encourage you to play with it and it's various options and 
see what happens...

https://lucene.apache.org/solr/guide/8_5/filter-descriptions.html#word-delimiter-graph-filter

You've definitely need to enable som "non-default" options  (like 
"catenateNumbers=true") to ensure that you'd get indexed terms like 
"123456" from input "1234-56AB45"

Once thing that's not entirely clear from your question & input is how you 
define "partial string" ... for example: are you expecting a query of "12" 
to match your input document? because WDF won't help with that.

: But the behavior I see is completely unexpected. Full string match works 
: fine on the customer's DEV environment but not in QA (which is running 
: the same version of SOLR)

I garuntee you there is some difference between your DEV and QA 
environments.  Either in terms of the documents in the index, or the 
schema THAT WAS USED WHEN INDEXING THE DOCS --
which might have been changed after the indexing happened, or 
the "current" schema being used when the queries are getting 
parsed, or the default request options in solrconfig.xml ... something is 
absolutely different.

: Partial string match works for some ID fields but not others
: A Partial string match when the user does not enter the dashes just never works

I'm assuming these last 2 comments refer to behavior you see on *both* 
your DEV and QA instances?

Depending on your definition of "partial string" (see the question i asked 
above) then I _think_ the analyzer you have should work -- at least for 
all the examples you've provided.

The missing piece of information is *how* you are querying: what query 
parser you are using, what exactly the iput looks like; and also: the 
output: what does "never works" mean? ... does it match 0 docs? does it 
match docs you don't expect?

seeing the exact request URLs you are trying, with 
"debug=true&echoParams=all" added, and the full output of those requests 
so we can see things like the header where we can confirm what 
default params might be getting added, and the query parrser debug info to 
doble check how your query is being parsed, and the "explain" info to see 
what docs that are matching (unexpectedly) are there.

More tips on details that can be useful to include to "help us help 
you"...

https://cwiki.apache.org/confluence/display/SOLR/UsingMailingLists

-Hoss
http://www.lucidworks.com/

Mime
View raw message