lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7639) Use Suffix Arrays for fast search with leading asterisks
Date Sat, 04 Feb 2017 21:14:51 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852937#comment-15852937
] 

Dawid Weiss commented on LUCENE-7639:
-------------------------------------

On vacation, so short. Now that I thought about it for longer I think my reasoning here was
wrong -- the reversed fst would work for prefix wildcards, but for infixes you'd still need
a full suffix tree (or an automaton created for all suffixes and leading to term ID):

bq. A suffix array is functionally equivalent to a suffix tree, which you could build and
encode as an FST. Then any infix matching would be done similarly to suffix array-based lookups.


> Use Suffix Arrays for fast search with leading asterisks
> --------------------------------------------------------
>
>                 Key: LUCENE-7639
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7639
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Yakov Sirotkin
>         Attachments: suffix-array.patch
>
>
> If query term starts with asterisks FST checks all words in the dictionary so request
processing speed falls down. This problem can be solved with Suffix Array approach. Luckily,
Suffix Array can be constructed after Lucene start from existing index. Unfortunately, Suffix
Arrays requires a lot of RAM so we can use it only when special flag is set:
> -Dsolr.suffixArray.enable=true
> It is possible to  speed up Suffix Array initialization using several threads, so we
can control number of threads with 
> -Dsolr.suffixArray.initialization_treads_count=5
> This system property can be omitted, the default value is 5.  
> Attached patch is the suggested implementation for SuffixArray support, it works for
all terms starting with asterisks with at least 3 consequent non-wildcard characters. This
patch do not change search results and  affects only performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message