tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
Date Tue, 01 Nov 2016 19:38:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626453#comment-15626453
] 

Hudson commented on TIKA-2098:
------------------------------

FAILURE: Integrated in Jenkins build tika-2.x #169 (See [https://builds.apache.org/job/tika-2.x/169/])
improve unit test for TIKA-2098 (tallison: rev 6ca74bec6a1d448bbe3340d51dc84ca8ca58507a)
* (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2098
>                 URL: https://issues.apache.org/jira/browse/TIKA-2098
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Alexander Kazakov
>            Assignee: Tim Allison
>              Labels: java, parser, pdf
>             Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata metadata,
int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a string: Tika
- Content Analysis Toolkit
> 	at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
> 	at org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
> 	at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
> 	at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
> 	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
> 	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> 	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
> 	... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than
100 characters, and so your requested limit has been reached. To receive the full text of
the document, increase your limit. (Text up to the limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 characters,
and so your requested limit has been reached. To receive the full text of the document, increase
your limit. (Text up to the limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document
contained more than 100 characters, and so your requested limit has been reached. To receive
the full text of the document, increase your limit. (Text up to the limit is however available).
> 	at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
> 	at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
> 	at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
> 	at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
> 	... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than
100 characters, and so your requested limit has been reached. To receive the full text of
the document, increase your limit. (Text up to the limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document
contained more than 100 characters, and so your requested limit has been reached. To receive
the full text of the document, increase your limit. (Text up to the limit is however available).
> 	at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	... 51 more
> Caused by: org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your
document contained more than 100 characters, and so your requested limit has been reached.
To receive the full text of the document, increase your limit. (Text up to the limit is however
available).
> 	at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	... 52 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message