lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6079) PatternReplaceCharFilter crashes JVM with OutOfMemoryError
Date Thu, 27 Nov 2014 13:48:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227671#comment-14227671
] 

Jack Krupansky commented on LUCENE-6079:
----------------------------------------

But the pattern might in fact need the entire input, such as to match the end of the input
with "$".

Still, it would be nice to have an optional "chunked mode" for cases such as this (assuming
that pattern doesn't end with "$"), such as input which is the full text of a multi-MB PDF
file. I would suggest that such as mode be the default, with a reasonable chunk size such
as 100K. There should also be an "overlap" size so that when reading the next chunk it would
start matching with an overlap from the end of the previous chunk, and not perform a match
that extends into the overlap area at the end of a chunk unless it is the last chunk, so that
matches could be made across chunk boundaries.

Actually, it turns out that there was such a feature, with a "maxBlockChars" parameter, but
it was deprecated long ago - no mention in CHANGES.TXT. But... it's still supported in the
factory code, with only a "TODO" comment suggesting that a warning would be appropriate, but
the actual Lucene filter constructor simply ignores this parameter.



> PatternReplaceCharFilter crashes JVM with OutOfMemoryError
> ----------------------------------------------------------
>
>                 Key: LUCENE-6079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6079
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.10.2
>         Environment: Microsoft Windows, x86_64, 32 GB main memory
>            Reporter: Alexander Veit
>            Priority: Critical
>
> PatternReplaceCharFilter fills memory with input data until an OutOfMemoryError is thrown.
> java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Arrays.java:3332)
> 	at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> 	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> 	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
> 	at java.lang.StringBuilder.append(StringBuilder.java:190)
> 	at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.fill(PatternReplaceCharFilter.java:84)
> 	at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.read(PatternReplaceCharFilter.java:74)
>     ...
> PatternReplaceCharFilter should read data chunk-wise and pass the transformed output
chunk-wise to the caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message