lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Gibney <mich...@michaelgibney.net>
Subject Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"
Date Thu, 05 Dec 2019 18:22:57 GMT
I wonder if this might be similar/related to the underlying problem
that is intended to be addressed by
https://issues.apache.org/jira/browse/LUCENE-8985?

btw, I think you only want to use FlattenGraphFilter *once* in the
indexing analysis chain, towards the end (after all components that
emit graphs). ...though that's probably *not* what's causing the
problem (based on the fact that the extra FGF doesn't seem to modify
any attributes).



On Mon, Nov 25, 2019 at 2:19 PM Eric Buss <ericbuss@abebooks.com> wrote:
>
> Hi all,
>
> I have been trying to solve an issue where FlattenGraphFilter (FGF) removes
> tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches that
> contain the contraction "can't" do not match.
>
> This is on Solr version 7.7.1.
>
> The field in question is defined as follows:
>
> <field name="myField" type="text_general" indexed="true" stored="true"/>
>
> And the relevant fieldType "text_general":
>
> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0"
preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
>         <filter class="solr.FlattenGraphFilterFactory"/>
>         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
>         <filter class="solr.FlattenGraphFilterFactory"/>
>         <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0"
preserveOriginal="0" catenateAll="0" splitOnCaseChange="0"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>         <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     </analyzer>
> </fieldType>
>
> Finally, the relevant entries in synonyms.txt are:
>
> can,cans
> cants,cant
>
> Using the Solr console Analysis and "can't" as the Field Value, the following
> tokens are produced (find the verbose output at the bottom of this email):
>
> Index
> ST    | can't
> SF    | can't
> WDGF  | cant | can't | can | t
> FGF   | cant | can't | can | t
> SGF   | cants | cant | can't | | cans | can | t
> ICUFF | cants | cant | can't | | cans | can | t
> FGF   | cants | cant | can't | | t
>
> Query
> ST    | can't
> SF    | can't
> WDGF  | can | t
> SF    | can | t
> ICUFF | can | t
>
> As you can see after the FGF the tokens "can" and "cans" are pruned so the query
> does not match. Is there a reasonable way to preserve these tokens?
>
> My key concern is that I want the "fix" for this to have as little impact on
> other queries as possible.
>
> Some things I have checked/tried:
>
> Searching for similar problems I found this thread:
> https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
> Here it is suggested that FGF is not necessary (without any supporting
> evidence). This goes directly against the documentation that states "If you use
> [the SynonymGraphFilter] during indexing, you must follow it with a Flatten
> Graph Filter":
> https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
> Despite this warning I tried out removing the FGF on a local
> cluster and indeed it still runs and this search now works, however I am
> paranoid that this will break far more things than it fixes.
>
> I have tried adding the FGF as a filter to the query. This does not eliminate
> the "can" term in the query analysis.
>
> I have tested other contracted words. Some have this issue as well - others do
> not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
> preserve their tokens "won't" does not. I believe the pattern here is that
> whenever part of the contraction has synonyms this problem manifests.
>
> Eliminating WDGF is not viable as we rely on this functionality for other uses
> of delimiters (such as wi-fi -> wi fi).
>
> Performing WDGF after synonyms is also not viable as in the case that we have
> the data "historical-text" we want this to match the search "history text".
>
> The hacky solution I have found is to use the PatternReplaceFilterFactory to
> replace "can't" with "cant". Though this technically solves the issue, I hope it
> is obvious why this does not feel like an ideal solution.
>
> Has anyone encountered this type of issue before? Any advice on how the filter
> use here could be improved to handle this case?
>
> Thanks,
> Eric Buss
>
>
> PS. The verbose output from Analysis of "can't"
>
> Index
>
> ST    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> SF    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> WDGF  | text          | cant          | can't            | can        | t          |
>       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]       |
>       | start         | 0             | 0                | 0          | 4          |
>       | end           | 5             | 5                | 3          | 5          |
>       | positionLength| 2             | 2                | 1          | 1          |
>       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM>
| <ALPHANUM> |
>       | termFrequency | 1             | 1                | 1          | 1          |
>       | position      | 1             | 1                | 1          | 2          |
>       | keyword       | false         | false            | false      | false      |
> FGF   | text          | cant          | can't            | can        | t          |
>       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]       |
>       | start         | 0             | 0                | 0          | 4          |
>       | end           | 5             | 5                | 3          | 5          |
>       | positionLength| 2             | 2                | 1          | 1          |
>       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM>
| <ALPHANUM> |
>       | termFrequency | 1             | 1                | 1          | 1          |
>       | position      | 1             | 1                | 1          | 2          |
>       | keyword       | false         | false            | false      | false      |
> SGF   | text          | cants            | cant          | can't            | cans  
       | can        | t          |
>       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [63 61
6e 73] | [63 61 6e] | [74]       |
>       | start         | 0                | 0             | 0                | 0     
       | 0          | 4          |
>       | end           | 5                | 5             | 5                | 3     
       | 3          | 5          |
>       | positionLength| 1                | 1             | 2                | 1     
       | 1          | 1          |
>       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>   
   | SYNONYM       | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1                | 1             | 1                | 1     
       | 1          | 1          |
>       | position      | 1                | 1             | 1                | 3     
       | 3          | 4          |
>       | keyword       | false            | false         | false            | false 
       | false      | false      |
> FGF   | text          | cants            | cant          | can't            | t     
    |
>       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74]  
    |
>       | start         | 0                | 0             | 0                | 4     
    |
>       | end           | 5                | 5             | 5                | 5     
    |
>       | positionLength| 1                | 1             | 1                | 1     
    |
>       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>   
   | <ALPHANUM> |
>       | termFrequency | 1                | 1             | 1                | 1     
    |
>       | position      | 1                | 1             | 1                | 3     
    |
>       | keyword       | false            | false         | false            | false 
    |
> ICUFF | text          | cants            | cant          | can't            | t     
    |
>       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74]  
    |
>       | start         | 0                | 0             | 0                | 4     
    |
>       | end           | 5                | 5             | 5                | 5     
    |
>       | positionLength| 1                | 1             | 1                | 1     
    |
>       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>   
   | <ALPHANUM> |
>       | termFrequency | 1                | 1             | 1                | 1     
    |
>       | position      | 1                | 1             | 1                | 3     
    |
>       | keyword       | false            | false         | false            | false 
    |
>
> Query
>
> ST    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> SF    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> WDGF  | text          | can        | t          |
>       | raw_bytes     | [63 61 6e] | [74]       |
>       | start         | 0          | 4          |
>       | end           | 3          | 5          |
>       | positionLength| 1          | 1          |
>       | type          | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1          | 1          |
>       | position      | 1          | 2          |
>       | keyword       | false      | false      |
> SF    | text          | can        | t          |
>       | raw_bytes     | [63 61 6e] | [74]       |
>       | start         | 0          | 4          |
>       | end           | 3          | 5          |
>       | positionLength| 1          | 1          |
>       | type          | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1          | 1          |
>       | position      | 1          | 2          |
>       | keyword       | false      | false      |
> ICUFF | text          | can        | t          |
>       | raw_bytes     | [63 61 6e] | [74]       |
>       | start         | 0          | 4          |
>       | end           | 3          | 5          |
>       | positionLength| 1          | 1          |
>       | type          | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1          | 1          |
>       | position      | 1          | 2          |
>       | keyword       | false      | false      |
>

Mime
View raw message