tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2151) Imposed Write Limit Causes Lost Data With Pdfs
Date Tue, 01 Nov 2016 18:08:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626194#comment-15626194
] 

Tim Allison commented on TIKA-2151:
-----------------------------------

I think this may be a duplicate of TIKA-2098.  The fix will be in Tika 1.14, which should
be out towards the end of the week.

I just improved the unit test for TIKA-2098 to be:

{noformat}
    @Test
    public void testMaxLength() throws Exception {
        InputStream is = getResourceAsStream("/test-documents/testPDF.pdf");
        String content = new Tika().parseToString(is, new Metadata(), 100);
        assertTrue(content.length() == 100);
        assertContains("Tika - Content", content);
    }
{noformat}

> Imposed Write Limit Causes Lost Data With Pdfs
> ----------------------------------------------
>
>                 Key: TIKA-2151
>                 URL: https://issues.apache.org/jira/browse/TIKA-2151
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.13
>            Reporter: Josh Cummings
>            Priority: Critical
>
> When we upgraded to 1.13, we noticed a new exception in our logs:
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:184)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:144)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.Tika.parseToString(Tika.java:527)
> 	at org.apache.tika.Tika.parseToString(Tika.java:602)
> 	at com.attask.tika.WriteLimitAllCatchTikaTest.testStillNeedOverride(WriteLimitAllCatchTikaTest.java:31)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> 	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> 	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
> 	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
> 	at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
> 	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78)
> 	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212)
> 	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a string:   One
will of mine to make thy large will more. 
> 	at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:500)
> 	at org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
> 	at org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
> 	at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
> 	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:214)
> 	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> 	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:160)
> 	... 33 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than
100000 characters, and so your requested limit has been reached. To receive the full text
of the document, increase your limit. (Text up to the limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100000 characters,
and so your requested limit has been reached. To receive the full text of the document, increase
your limit. (Text up to the limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document
contained more than 100000 characters, and so your requested limit has been reached. To receive
the full text of the document, increase your limit. (Text up to the limit is however available).
> 	at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
> 	at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
> 	at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
> 	at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:305)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:498)
> 	... 41 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more than
100000 characters, and so your requested limit has been reached. To receive the full text
of the document, increase your limit. (Text up to the limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document
contained more than 100000 characters, and so your requested limit has been reached. To receive
the full text of the document, increase your limit. (Text up to the limit is however available).
> 	at org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	... 49 more
> Caused by: org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your
document contained more than 100000 characters, and so your requested limit has been reached.
To receive the full text of the document, increase your limit. (Text up to the limit is however
available).
> 	at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	... 50 more
> This appears to be caused by the fact that the top-level exception is not of type SAXException,
meaning that Tika#parseToString doesn't catch it and check whether or not its root cause is
a WriteLimitReachedException. The result is that the first 100000 parsed characters is not
returned.
> Here is a quick repro code block:
> Tika tika = new Tika();
> InputStream is = this.getClass().getClassLoader().getResourceAsStream("pg100.pdf");
> try {
> 	String s = tika.parseToString(is);
> 	System.out.println("It works!");
> } catch ( Exception e ) {
> 	System.out.println("Tika missed the WriteLimitReachedException");
> }
> Where the pdf used is a pdf that has more than 100000 parseable characters in it.
> Not sure I understand all the ins and outs, but we fixed it by extending Tika.java and
overriding Tika#parseToString to catch Exception instead of SAXException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message