lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oakley, Craig (NIH/NLM/NCBI) [C]" <craig.oak...@nih.gov>
Subject change in White Space when upgrading 6.6 to 7.4
Date Fri, 01 Feb 2019 16:55:16 GMT
We had a problem when upgrading from Solr 6.6 to Solr 7.4 in that a query ceased to work.


The query was of the form http://localhost:8983/solr/collection/select?indent=on&q=ABC4856.21%20AND%20-field1:ABC4856.21&wt=json&rows=0

Basically finding a count of those records where there is some field which has "ABC4856.21",
but where the field field1 does not have that string (in other words, where there is some
field other than field1 which has "ABC4856.21")

For this particular collection, running the query against Solr 6.6 resulted in "response":{"numFound":0"
(which was correct), but running it against Solr 7.4 resulted in ""response":{"numFound":21322074"

After some investigation, it seemed to be a problem with the initial "ABC4856.21" being tokenized
as "ABC4856" and "21"

We found various work-arounds such as putting quotation marks around the string or adding
"*:" after the "q="; but the user wanted the exact same query to work in Solr 7.4 as it had
in Solr 6.6

Eventually, we found a solution by adding "<str name="sow">true</str>" to the
Select handler in solrconfig.xml (for "Separate On Whitespace").

This solution seems to be sufficient; but we would like to be sure we understand the solution.

Looking at lucene.apache.org/solr/guide/7_4/tokenizers.html#standard-tokenizer it would seem
that the period should not split the string into two tokens.

Could someone clarify how we can know which Tokenize is used when, and which definition of
White Space is used when?

Thanks

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message