lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky (JIRA)" <>
Subject [jira] [Commented] (SOLR-3503) Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
Date Sun, 03 Jun 2012 14:29:23 GMT


Jack Krupansky commented on SOLR-3503:

It could be tricky, but it could work, but users would have to be made aware of how wildcards
could interfere or interact with stemming. And testing is essential, as well as good user
documentation of how to navigate the stemming vs. wildcards minefield.

Unless the user actually knows what the stemmed term will be, even simple trailing wildcards
can be tricky since the stem could be much shorter than the user expects. For example "investment*"
where the actual stemmed and indexed term might be "invest" for a particular stemmer.

Leading wildcards can sometimes be okay, but completely dependent on the particular stemmer.
For example, "*ment".

And simple embedded wildcards can be a real wildcard, once again depending on the specific
stemmer. For example, "inve*ment".

But, I don't think any or all of those concerns are any worse than the situation we have today.

But, some robust tests would be needed to persuade me that this improvement is actually okay.

Right now, I say go for it, including the test examples for various stemmers and documentation
for issues that users must be aware of (call it "safe wildcards in the presence of stemming.")
I think the only restriction is that query results should not be worse than without this improvement.

Unfortunately, the doc may be stemmer-dependent. And separate tests needed for each stemmer.

The bottom line is to reduce the surprise factor for the user.

As a side note, it would be nice if Solr had a mechanism to return "informative notes and
warnings" with a query response. For example, "Wildcard term inves*ment matches no indexed

> Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
> ---------------------------------------------------------------------
>                 Key: SOLR-3503
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 4.0, 5.0
> It seems to me that all the stemmers could be MultiTermAware, anyone know of a reason

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message