tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Goldenberg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2627) Exception thrown when max string length is reached
Date Thu, 13 Sep 2018 17:16:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613814#comment-16613814
] 

Dmitry Goldenberg commented on TIKA-2627:
-----------------------------------------

I agree, there is something wrong here for sure. The whole point is to just drop any excess
text.

 
{code:java}
// In org/apache/tika/sax/WriteOutContentHandler 
  @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (writeLimit == -1 || writeCount + length <= writeLimit) {
            super.characters(ch, start, length);
            writeCount += length;
        } else {
            super.characters(ch, start, writeLimit - writeCount);
            writeCount = writeLimit;
            throw new WriteLimitReachedException(
                    "Your document contained more than " + writeLimit
                    + " characters, and so your requested limit has been"
                    + " reached. To receive the full text of the document,"
                    + " increase your limit. (Text up to the limit is"
                    + " however available).", tag);
        }
    }
{code}
This should not throw; at the maximum, this should just log a warning and keep going.

> Exception thrown when max string length is reached
> --------------------------------------------------
>
>                 Key: TIKA-2627
>                 URL: https://issues.apache.org/jira/browse/TIKA-2627
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Windows 2012 R2
> Java 1.8.0_151
>            Reporter: Caleb Ott
>            Priority: Major
>         Attachments: ExceptionStacktrace.txt
>
>
> I have set the max string length and expected tika to parse up to that limit then return
me the text. However, for certain files it appears that once that limit is reached, instead
of returning the text parsed so far, it is throwing an exception.
> It looks like the WriteLimitReachedException is being wrapped in another exception which
is why it is not being caught.
> Attached is the stack trace I am getting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message