tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niall Pemberton" <niall.pember...@gmail.com>
Subject Re: Problem with WordParser
Date Thu, 20 Dec 2007 18:06:59 GMT
On Dec 19, 2007 7:43 PM, Mats Norén <mats.noren@gmail.com> wrote:
> Hello,
> I've been trying to extract text from a couple of different MS-Word
> files and I'm getting mixed results.
> Almost by random (as I see it) I get this error:
> java.lang.StringIndexOutOfBoundsException: String index out of range: -21047
>         at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:886)
>         at java.lang.StringBuffer.substring(StringBuffer.java:417)
>         at org.apache.poi.hwpf.model.TextPiece.substring(TextPiece.java:88)
>         at org.apache.tika.parser.microsoft.WordParser.extractText(WordParser.java:163)
>
> Looking at the TextPiece in POI I can see that the substring method is
> called with a negative value for end
>
> public String substring(int start, int end)
>    {
>      int denominator = _usesUnicode ? 2 : 1;
>
>      return ((StringBuffer)_buf).substring(start/denominator, end/denominator);
>    }
>
> I just can't see why / how runEnd - currentTextStart can end up being
> a negative value.

>From my reading of the code I can't see how it can be anything other
than zero or negative if/when it gets to line 163 of Tika's WordParser
- since before that it loops until runEnd is less than or equal to
currentTextEnd:
    while (runEnd > currentTextEnd) {
        ...
    }
    String str = currentPiece.substring(0, runEnd - currentTextStart);

IMO this is a Tika bug and you should file a bug report (preferrably
with an attached example Word document that causes the issue):
    https://issues.apache.org/jira/browse/TIKA

Niall

> String str = currentPiece.substring(0, runEnd - currentTextStart);
>
> Any ideas?
>
> Regards Mats
>

Mime
View raw message