lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <>
Subject RE: Solr Wildcard Search
Date Thu, 30 Nov 2017 14:50:17 GMT
A slightly more refined answer...  In my experience with the systems I've worked with, Porter
and other stemmers can be useful as a "fallback field" with a really low boost, but you should
be really careful if you're only searching on one field.

Cannot recommend Doug Turnbull and John Berryman's "Relevant Search" enough on how to layer
fields...among many other great insights:

 -----Original Message-----
From: Allison, Timothy B. [] 
Sent: Thursday, November 30, 2017 9:20 AM
Subject: RE: Solr Wildcard Search

At the very least the English possessive filter, which you have.  Great!

Depending on what your query log analysis finds -- perhaps users are pretty much only searching
on nouns? -- you might consider EnglishMinimalStemFilterFactory.

I wouldn't say that porter was or wasn't chosen intentionally.  It may be good for some use
cases.  However, for the use cases I've seen, it has been disastrous.   

I have code that shows "equivalence sets" for analysis chain A vs analysis chain B...with
some noise...assume same tokenization...  I should probably share that code on github or fold
it into Luke somehow?  You can see this on a one-off basis in the Solr admin window via the
Analysis tab, but to see this on your corpus/corpora across terms can be eye-opening, and
then to cross-check it against query logs...quite powerful.

On one corpus, when I compared the same analysis chain A without Porter and B with porter,
the output is e.g.:

"stemmed\tunstemmed #docs|unstemmed #docs..."

public	public 9834 | publication 1429 | publications 960 | publicly 662 | public's 176 | publicize
118 | publicized 107 | publicity 91 | publically 66 | publicizing 63 | publication's 6 | publicizes
4 | public_ 1 | publication_ 1 | publiced 1

effect	effective 6329 | effect 3157 | effectively 1745 | effectiveness 1198 | effects 831
| effected 139 | effecting 85 | effectives 1

new	new 13279 | newness 6 | newed 3 | newe 2 | newing 1

order	order 7256 | orders 3125 | ordered 1840 | ordering 758 | orderly 241 | order's 17 |
orderable 3 | orders_ 1

Imagine users searching for "publication" (~2500 docs) and getting back every document that
mentions "public" (~10k).  That's a huge problem in many circumstances.  Good luck finding
the name "newing".

-----Original Message-----
From: Georgy Nevsky []
Sent: Thursday, November 30, 2017 8:31 AM
Subject: RE: Solr Wildcard Search

I understand stemming reason. Thank you.

What do you suggest to use for stemming instead of "Porter" ? I guess, it wasn't chosen intentionally.

In the best we trust
Georgy Nevsky

-----Original Message-----
From: Allison, Timothy B. []
Sent: Thursday, November 30, 2017 8:25 AM
Subject: RE: Solr Wildcard Search

The initial question wasn't about a phrasal search, but I largely agree that diff q parsers
handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!

As has been pointed out, use the Solr admin window and the "debug" in the query option to
see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified by each step
in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is stemmed to "ship"...hence
all of your matches work.  Porter doesn't have rules for words ending in "pp", so it doesn't
stem "shipp" to "ship".  So, your wildcard query is looking for words that start with "shipp",
and given that "shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs"
because porter wouldn't know what to do with that 😊

Again, Porter can be very dangerous if it doesn't align with user expectations.

-----Original Message-----
From: Atita Arora []
Sent: Thursday, November 30, 2017 8:16 AM
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into multiple terms
ORed together , I believe if the use case requires to perform wildcard search on phrases ,
we would need to store the entire phrase as a single term in the index which probably is not
happening right now and hence are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , however as soon
as I do phrase search it fails for the reason as i mentioned above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky <>

> I wish to understand if I can do something to get in result term 
> "shipping"
> when search for "shipp*"?
> Here field definition:
> <field name="name_en-us" type="text_en" indexed="true" stored="true"
> multiValued="false"/>
> <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="lang/stopwords_en.txt"
>             />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
> Anything else can be important? Most configuration parameters are 
> default to Apache Solr 7.1.0.
> In the best we trust
> Georgy Nevsky
> -----Original Message-----
> From: Rick Leir []
> Sent: Thursday, November 30, 2017 7:32 AM
> To:
> Subject: Re: Solr Wildcard Search
> George,
> When you get those results it could be due to stemming.
> Wildcard processing expands your term to multiple terms, OR'd 
> together. It also takes you down a different analysis pathway, as many 
> analysis components do not work with multiple terms. Look into the 
> SolrAdmin console, and use the analysis tab to understand what is 
> going on.
> If you still have doubts, tell us more about your config.
> Cheers --Rick
> On November 30, 2017 7:06:42 AM EST, Georgy Nevsky 
> <> wrote:
> >Can somebody help me understand how Solr Wildcard Search is working?
> >
> >If I’m doing search for “ship*” term I’m getting in result many 
> >strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”,

> >etc.
> >
> >But if I’m searching for “shipp*” I don’t get any result.
> >
> >
> >
> >In the best we trust
> >
> >Georgy Nevsky
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com

View raw message