lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7914) Add safeguards to RegExp.toAutomaton and Operations
Date Tue, 01 Aug 2017 13:35:01 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108892#comment-16108892
] 

Robert Muir commented on LUCENE-7914:
-------------------------------------

ok, thanks.

As far as finite stuff, we might be able to remove the requirement that this is provided to
AutomatonTermsEnum at all (see https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java#L200-L202)

It is kind of a leftover from when its intersection was less efficient (it handled infinite
and finite DFAs differently). Nowadays it handles the case where its actually in a looping
portion and drives that part completely by the terms dictionary. 

I'm not sure if the current "if finite" guards to this loop detection really save us CPU for
complex finite DFAs (e.g. fuzzy, spellcheck). That would be the thing to experiment with to
see if this boolean could be removed...

> Add safeguards to RegExp.toAutomaton and Operations
> ---------------------------------------------------
>
>                 Key: LUCENE-7914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7914
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Jim Ferenczi
>         Attachments: LUCENE-7914.patch, LUCENE-7914.patch, LUCENE-7914.patch, LUCENE-7914.patch
>
>
> When creating an automaton from a regexp, some operators can create more states than
maxDeterminizedStates:
> {code}
> a{10000}
> {code}
> The example above <b>creates</b> a single path with 10k states already determinized
so maxDeterminizedStates is not checked. 
> Some operations on automaton like Operations.isFinite or Operations.topoSortStates are
recursive and the maximum level of recursion depends on the longest path in the automaton.
So a large automaton like above can exceed java's stack.
> In most of the cases we are covered by maxDeterminizedStates but there will always be
adversarial cases where a large automaton is created from a small input so I think we should
also have safeguards in the recursive methods. 
> I've attached a patch that adds a max recursion level to Operations.isFinite and Operations.topoSortStates
in order to limit stack overflows. The limit is set to 1000 so any automaton with a path bigger
than 1000 would throw an IllegalStateException.
> The patch also uses maxDeterminizedStates to limit the number of states that a repeat
operator can create and throw a TooComplex..Exception when this limit is reached.
> Finally the patch adds the ability to skip Operations.isFinite on AutomatonQuery and
uses this as an optimization for PrefixQuery that uses infinite automatons only.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message