lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nik Everett (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-6046) RegExp.toAutomaton high memory use
Date Mon, 03 Nov 2014 16:36:34 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nik Everett updated LUCENE-6046:
--------------------------------
    Attachment: LUCENE-6046.patch

First cut at a patch.  Adds maxDeterminizedStates to Operations.determinize and pipes it through
to tons of places.  I think its important never to hide when determinize is called because
of how potentially heavy it is.  Forcing callers of MinimizationOperations.minimize, Operations.reverse,
Operations.minus etc to specify maxDeterminizedStates makes it pretty clear that the automaton
might be determinized during those processes.

I added an unchecked exception for when the Automaton can't be determinized within the specified
number of state but I'm really tempted to change it to a checked exception to make it super
duper obvious when determinization might occur.

> RegExp.toAutomaton high memory use
> ----------------------------------
>
>                 Key: LUCENE-6046
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6046
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/queryparser
>    Affects Versions: 4.10.1
>            Reporter: Lee Hinman
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-6046.patch
>
>
> When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible
for the automaton to use so much memory it exceeds the maximum array size for java.
> The following caused an OutOfMemoryError with a 32gb heap:
> {noformat}
> new RegExp("\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}").toAutomaton();
> {noformat}
> When increased to a 60gb heap, the following exception is thrown:
> {noformat}
>   1> java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum
array in java (2147483623)
>   1>     __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
>   1>     org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
>   1>     org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
>   1>     org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
>   1>     org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
>   1>     org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
>   1>     org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
>   1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
>   1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message