tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding
Date Thu, 09 Aug 2012 22:13:19 GMT

    [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432218#comment-13432218

Ken Krugler commented on TIKA-881:

I've asked Jukka to look into this. From my email to tika-dev:

The fix that Klaus provided avoids using reset() on the input stream.

But I thought that Tika tries to wrap streams such that a reset() will work properly, as otherwise
auto detection of content can fail.

I haven't had to dig into all of the tricky issues around stream management, so I'm hoping
you can take a look at Klaus's report and provide commentary.
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>            Assignee: Ken Krugler
>              Labels: stability
>         Attachments: BugfixHtmlParser.java, HtmlParser.java
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately).

> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position
is marked and the readlimit (maximum number of bytes to be read before the mark position gets
invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc
you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream
will throw the Exception because the mark position gets invalidated (the number of read bytes
exceeds the readlimit).

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message