lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters
Date Tue, 16 Jul 2013 11:56:49 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709695#comment-13709695
] 

Michael McCandless commented on LUCENE-5030:
--------------------------------------------

Sorry for the long delay here ...

Just to verify: there is no point to passing FUZZY_UNICODE_AWARE to AnalyzingSuggester, right?

In which case, I think we the AnalyzingLookupFactory should not be changed?

But, furthermore, I think we can isolate the changes to FuzzySuggester?  E.g., move the FUZZY_UNICODE_AWARE
flag down to FuzzySuggester, fix its ctor to strip that option when calling super() and move
the isFuzzyUnicodeAware down as well, and then override toLookupAutomaton to do the utf8 conversion
+ det?

This way it's not even possible to send the fuzzy flag to AnalyzingSuggester.
                
> FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for
1-byte (like English) and multi-byte (non-Latin) letters
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5030
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5030
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.3
>            Reporter: Artem Lukanin
>            Assignee: Michael McCandless
>             Fix For: 5.0, 4.4
>
>         Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, benchmark-wo_convertion.txt,
LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch,
nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, nonlatin_fuzzySuggester4.patch,
nonlatin_fuzzySuggester_combo1.patch, nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch,
nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch,
run-suggest-benchmark.patch
>
>
> There is a limitation in the current FuzzySuggester implementation: it computes edits
in UTF-8 space instead of Unicode character (code point) space. 
> This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character
space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion
in Unicode character space, then convert that automaton to UTF-8, then intersect with the
suggest FST.
> See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message