tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Caleb Ott (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2787) Make WriteLimitReachedException public and not subclass of SAXException
Date Thu, 27 Dec 2018 16:31:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729706#comment-16729706
] 

Caleb Ott edited comment on TIKA-2787 at 12/27/18 4:30 PM:
-----------------------------------------------------------

[~dgoldenberg123] I agree with what you are saying here. I have also had similar issues noted
in this ticket: https://issues.apache.org/jira/browse/TIKA-2627.

A slightly nicer workaround is to use the "isWriteLimitReached" method on WriteOutContentHandler.
See the updated code.
{code:java}
WriteOutContentHandler writer = new WriteOutContentHandler(limit); // <-- e.g. set to 1000000
ContentHandler handler = new BodyContentHandler(writer); 
try {
    parser.parse(dataStream, handler, metadata, parseCtx);
} catch (Exception ex) {
    // Write limit exception could be wrapped in a TikaException
    if (!writer.isWriteLimitReached(ex)) {
        throw ex;
    } else {
        log.warn("TE limit reached on file {}.", filePath);
    }
}

// Keep the extracted text regardless of WriteLimitReachedException
String text = handler.toString();

{code}
 

Edit: The more I think about it, I am not sure making the exception public would help very
much. The exception usually gets wrapped up in other SAX and Tika Exceptions as their causes.
"isWriteLimitReached" recursively checks the exception to see if any of the causes are the
write limit exception. That will probably work better than having the exception public.

 


was (Author: cott@redstonecontentsolutions.com):
[~dgoldenberg123] I agree with what you are saying here. I have also had similar issues noted
in this ticket: https://issues.apache.org/jira/browse/TIKA-2627.

A slightly nicer workaround is to use the "isWriteLimitReached" method on WriteOutContentHandler.
See the updated code.
{code:java}
WriteOutContentHandler writer = new WriteOutContentHandler(limit); // <-- e.g. set to 1000000
ContentHandler handler = new BodyContentHandler(writer); 
try {
    parser.parse(dataStream, handler, metadata, parseCtx);
} catch (Exception ex) {
    // Write limit exception could be wrapped in a TikaException
    if (!writer.isWriteLimitReached(ex)) {
        throw ex;
    } else {
        log.warn("TE limit reached on file {}.", filePath);
    }
}

// Keep the extracted text regardless of WriteLimitReachedException
String text = handler.toString();

{code}

> Make WriteLimitReachedException public and not subclass of SAXException
> -----------------------------------------------------------------------
>
>                 Key: TIKA-2787
>                 URL: https://issues.apache.org/jira/browse/TIKA-2787
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.19.1
>            Reporter: Dmitry Goldenberg
>            Priority: Major
>
> The idea behind being able to set a limit on text extraction is to be able to get up
to N characters extracted back. We just got tripped up by the fact that Tika throws an exception
once the limit has been reached.
> This, in and of itself, is not a major hindrance especially since the error message itself
clearly states that the extracted text is, "however, available".
> OK, but why is WriteLimitReachedException private? why not public so it can be explicitly
caught when the parse() method is called? and why not add it to the signature of the parse
method? I don't think it should extend SAXException, either; just cleanly throw it as is.
> Right now, our code makes this cumbersome adjustment around the condition:
> {code:java}
> ContentHandler handler = new BodyContentHandler(limit); // <-- e.g. set to 1000000
> try {
>     parser.parse(dataStream, handler, metadata, parseCtx);
> } catch (IOException | TikaException ex) {
>     throw ex;
> } catch (SAXException ex) {
>     String message = (ex.getMessage() == null) ? "" : ex.getMessage();
>     if (!message.contains("Your document contained more than")) {
>         throw new TikaException("Tika error has occurred.", ex);
>     } else {
>         log.warn("TE limit reached on file {}.", filePath);
>     }
> }
> // Keep the extracted text regardless of WriteLimitReachedException
> String text = handler.toString();
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message