tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Adler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-577) IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures
Date Wed, 22 Dec 2010 04:38:01 GMT

     [ https://issues.apache.org/jira/browse/TIKA-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Adler updated TIKA-577:
------------------------------

    Description: 
When cracking a Word 03 document (which, unfortunately, I cannot upload -- it has client-confidential
data), an index out of bounds exception occurs in the POI code used by the WordExtractor.
To try to make up for the unavailable doc file, I've included the results of a couple of hours
stepping through the code to find the failure point. The error occurs because point[0] = point[1]
= 301; upperbound of _paragraphs = 301. This is in the method org.apache.poi.hwpf.usermodel.CharacterRun()
.

The method + line numbers are:

public CharacterRun getCharacterRun(int index)

line 792:	int[] point = findRange(_paragraphs, _parStart, Math.max(chpx.getStart(), _start),
chpx.getEnd());
line 794:	PAPX papx = _paragraphs.get(point[0]);  // <<< This is the source of the
exception

STACK at time of exception:

Range.GetCharacterRun(int) line 794
PicturesTable.getAllPictures() line 191
WordExtractor$PicturesSource.<init>(HPWFDocument) line 429
WordExtractor$PicturesSource.<init>(HPWFDocument, WordExtractor#1) line 419
WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75
OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext)
line 187
DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext)
line 197
AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext)
line 197
AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext) line 137
... (my project) ...


As noted, this occurs in a Word 2003 doc which has no pictures (it is a table); 147 character
runs (0 - 146) found in first pass. Problem occurs on
first pass (not sure if there will be others) on this run. Last run in this code section from
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(),
lines 186-191:


  public List<Picture> getAllPictures() {
    ArrayList<Picture> pictures = new ArrayList<Picture>();

    Range range = _document.getOverallRange();
    for (int i = 0; i < range.numCharacterRuns(); i++) {
    	CharacterRun run = range.getCharacterRun(i);

Error occurs on getCharacterRun(i) when i = 146, which is the last run in the range. If I
change point[0] to 300 (in getCharacterRun), the call returns nicely to 
WordExtractor$PicturesSource<init>(HPWFDocument) line 429, setting the List all to an
empty List. Fails again later on subsequent call to
getAllPictures with same error.

POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun for the paragraph
in question.
Cannot send repro document - contains confidential client data.

  was:
When cracking a Word 03 document (which, unfortunately, I cannot upload -- it has client-confidential
data -- an index out of bounds exception occurs in the POI code used by the WordExtractor.
To try to make up for the unavailable doc file, I've included the resutls of a couple of hours
stepping through the code to find the failure point. The error occurs because point[0] = point[1]
= 30; upperbound of _paragraphs = 301. This is in the method org.apache.poi.hwpf.usermodel.CharacterRun()
.

The method + line numbers are:

public CharacterRun getCharacterRun(int index)

line 792:	int[] point = findRange(_paragraphs, _parStart, Math.max(chpx.getStart(), _start),
chpx.getEnd());
line 794:	PAPX papx = _paragraphs.get(point[0]);  // <<< This is the source of the
exception

STACK at time of exception:

Range.GetCharacterRun(nit) line 794
PicturesTable.getAllPictures() line 191
WordExtractor$PicturesSource.<init>(HPWFDocument) line 429
WordExtractor$PicturesSource.<init>(HPWFDocument, WordExtractor#1) line 419
WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75
OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext)
line 187
DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext)
line 197
AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext)
line 197
AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext) line 137
... (my project) ...


As noted, this occurs in a Word 2003 doc which has no pictures (it is a table); 147 character
runs (0 - 146) found in first pass. Problem occurs on
first pass (not sure if there will be others) on this run. Last run in this code section from
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(),
lines 186-191:


  public List<Picture> getAllPictures() {
    ArrayList<Picture> pictures = new ArrayList<Picture>();

    Range range = _document.getOverallRange();
    for (int i = 0; i < range.numCharacterRuns(); i++) {
    	CharacterRun run = range.getCharacterRun(i);

Error occurs on getCharacterRun(146) -- which is the last run in the range. If I change point[0]
to 300, the call returns nicely to 
WordExtractor$PicturesSource.<init>(HPWFDocument) line 429, setting <all> to an
empty list. Fails again later on subsequent call to
getAllPictures with same error.

POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun for the paragraph
in question.
Cannot send repro document - contains confidential client data.


Fix typos

> IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-577
>                 URL: https://issues.apache.org/jira/browse/TIKA-577
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
>            Reporter: Dennis Adler
>
> When cracking a Word 03 document (which, unfortunately, I cannot upload -- it has client-confidential
data), an index out of bounds exception occurs in the POI code used by the WordExtractor.
To try to make up for the unavailable doc file, I've included the results of a couple of hours
stepping through the code to find the failure point. The error occurs because point[0] = point[1]
= 301; upperbound of _paragraphs = 301. This is in the method org.apache.poi.hwpf.usermodel.CharacterRun()
.
> The method + line numbers are:
> public CharacterRun getCharacterRun(int index)
> line 792:	int[] point = findRange(_paragraphs, _parStart, Math.max(chpx.getStart(), _start),
chpx.getEnd());
> line 794:	PAPX papx = _paragraphs.get(point[0]);  // <<< This is the source
of the exception
> STACK at time of exception:
> Range.GetCharacterRun(int) line 794
> PicturesTable.getAllPictures() line 191
> WordExtractor$PicturesSource.<init>(HPWFDocument) line 429
> WordExtractor$PicturesSource.<init>(HPWFDocument, WordExtractor#1) line 419
> WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75
> OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext)
line 187
> DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext)
line 197
> AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext)
line 197
> AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext) line 137
> ... (my project) ...
> As noted, this occurs in a Word 2003 doc which has no pictures (it is a table); 147 character
runs (0 - 146) found in first pass. Problem occurs on
> first pass (not sure if there will be others) on this run. Last run in this code section
from org.apache.poi.hwpf.model.PicturesTable.getAllPictures(),
> lines 186-191:
>   public List<Picture> getAllPictures() {
>     ArrayList<Picture> pictures = new ArrayList<Picture>();
>     Range range = _document.getOverallRange();
>     for (int i = 0; i < range.numCharacterRuns(); i++) {
>     	CharacterRun run = range.getCharacterRun(i);
> Error occurs on getCharacterRun(i) when i = 146, which is the last run in the range.
If I change point[0] to 300 (in getCharacterRun), the call returns nicely to 
> WordExtractor$PicturesSource<init>(HPWFDocument) line 429, setting the List all
to an empty List. Fails again later on subsequent call to
> getAllPictures with same error.
> POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun for the
paragraph in question.
> Cannot send repro document - contains confidential client data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message