tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1822) NullPointerException when parsing a .doc file
Date Tue, 05 Jan 2016 14:30:39 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083128#comment-15083128
] 

Tim Allison commented on TIKA-1822:
-----------------------------------

When we can't get the ID for a linked object via POI's {{CharacterRun mscr = field.getMarkSeparatorCharacterRun(r);}},
should we add an annotation for an unknown id (e.g. {{<div class="embedded" id="_UNKNOWN_ID"
/>}}) or should we skip adding an annotation?


> NullPointerException when parsing a .doc file
> ---------------------------------------------
>
>                 Key: TIKA-1822
>                 URL: https://issues.apache.org/jira/browse/TIKA-1822
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.8
>         Environment: Linux
>            Reporter: Panagiotis Mpailis
>            Assignee: Tim Allison
>         Attachments: npe_example.doc
>
>
> We are using Tika 1.11 to extract text from msword documents, and there are a few errors
occurring when processing some docs.
> This ticket relates to https://issues.apache.org/jira/browse/TIKA-1733  however in this
case there is an unexpected NullPointerException and not a clear indication of the error.

> Processing a saved copy of the document solves the error altogether. A difference found
between the two documents was that the _(HWPFDocument)document.getRange()_ returned different
values. 
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@58a306e2
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.Tika.parseToString(Tika.java:496)
> 	at org.apache.tika.Tika.parseToString(Tika.java:610)
> Caused by: java.lang.NullPointerException
> 	at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:311)
> 	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:169)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 10 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message