tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-434) Bug in TagSoup causes IOException
Date Mon, 07 Jun 2010 22:26:12 GMT

    [ https://issues.apache.org/jira/browse/TIKA-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876454#action_12876454
] 

Jukka Zitting commented on TIKA-434:
------------------------------------

I came up with a fairly simple patch [1] that seems to solve this. I'll see what we can do
to push out an official release with this fix.

[1] http://github.com/jukka/tagsoup/commit/9cfe7b48745173faafa419f540538a0b6309b699

> Bug in TagSoup causes IOException
> ---------------------------------
>
>                 Key: TIKA-434
>                 URL: https://issues.apache.org/jira/browse/TIKA-434
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Andrew Khoury
>         Attachments: html_to_reproduce_issue.html
>
>
> When uploading documents to a jackrabbit 2.1 repository the following exception was received.
 It looks like a bug in tagsoup 1.2 (if you search the tagsoup yahoo group you can see that
it may be caused by '&' characters in the html being parsed):
> 27.05.2010 14:57:18 *WARN * LazyTextExtractorField: Failed to extract text from a binary
property (LazyTextExtractorField.java, line 180)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@eba477
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown
Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Pushback buffer overflow
>        at java.io.PushbackReader.unread(Unknown Source)
>        at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
>        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
>        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:177)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message