lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
Date Mon, 03 Nov 2014 09:51:34 GMT


Michael McCandless commented on LUCENE-6046:

Hmm, two bugs here.

First off, RegExp.toAutomaton is an inherently costly method: wasteful of RAM and CPU, doing
minimize after each recursive operation, in order to build a DFA in the end. It's unfortunately
quite easy to concoct regular expressions that make it consume ridiculous resources.  I'll
look at this example and see if we can improve it, but in the end it will always have its
"adversarial cases" unless we give up on making the resulting automaton deterministic, which
would be a very big change.

Maybe we should add adversary defenses to it, e.g. you set a limit on the number of states
it's allowed to create, and it throws a RegExpTooHardException if it would exceed that?

Second off, ArrayUtil.oversize has the wrong (too large) value for MAX_ARRAY_LENGTH, which
is a bug from LUCENE-5844.  Which JVM did you run this test on?

> RegExp.toAutomaton high memory use
> ----------------------------------
>                 Key: LUCENE-6046
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/queryparser
>    Affects Versions: 4.10.1
>            Reporter: Lee Hinman
>            Priority: Minor
> When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible
for the automaton to use so much memory it exceeds the maximum array size for java.
> The following caused an OutOfMemoryError with a 32gb heap:
> {noformat}
> new RegExp("\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}").toAutomaton();
> {noformat}
> When increased to a 60gb heap, the following exception is thrown:
> {noformat}
>   1> java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum
array in java (2147483623)
>   1>     __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
>   1>     org.apache.lucene.util.ArrayUtil.oversize(
>   1>     org.apache.lucene.util.ArrayUtil.grow(
>   1>     org.apache.lucene.util.automaton.Automaton$Builder.addTransition(
>   1>     org.apache.lucene.util.automaton.Operations.determinize(
>   1>     org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(
>   1>     org.apache.lucene.util.automaton.MinimizationOperations.minimize(
>   1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(
>   1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(
> {noformat}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message