tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mats Norén" <mats.no...@gmail.com>
Subject Problem with WordParser
Date Wed, 19 Dec 2007 19:43:18 GMT
Hello,
I've been trying to extract text from a couple of different MS-Word
files and I'm getting mixed results.
Almost by random (as I see it) I get this error:
java.lang.StringIndexOutOfBoundsException: String index out of range: -21047
	at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:886)
	at java.lang.StringBuffer.substring(StringBuffer.java:417)
	at org.apache.poi.hwpf.model.TextPiece.substring(TextPiece.java:88)
	at org.apache.tika.parser.microsoft.WordParser.extractText(WordParser.java:163)

Looking at the TextPiece in POI I can see that the substring method is
called with a negative value for end

public String substring(int start, int end)
   {
     int denominator = _usesUnicode ? 2 : 1;

     return ((StringBuffer)_buf).substring(start/denominator, end/denominator);
   }

I just can't see why / how runEnd - currentTextStart can end up being
a negative value.

String str = currentPiece.substring(0, runEnd - currentTextStart);

Any ideas?

Regards Mats

Mime
View raw message