tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-676) Boilerpipe fails
Date Tue, 08 Oct 2013 21:53:42 GMT

    [ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789746#comment-13789746
] 

Markus Jelsma commented on TIKA-676:
------------------------------------

Oh, i checked. None of my open issues are directly related to Boilerpipe, only HTML5 stuff
that should be fixed in TagSoup instead. I've submitted additions to TagSoup's html.tssl but
that's not likely to be incorporated any time soon.

> Boilerpipe fails
> ----------------
>
>                 Key: TIKA-676
>                 URL: https://issues.apache.org/jira/browse/TIKA-676
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>
> This is apparently a [boilerpipe issue |http://code.google.com/p/boilerpipe/issues/detail?id=24
], they fixed in the [Web API edition | http://boilerpipe-web.appspot.com/]. 
> {code}
> $ curl --fail -L http://thisrecording.com/the-past | java -jar tika-app-0.9.jar -T
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 65688    0 65688    0     0  17650      0 --:--:--  0:00:03 --:--:-- 18698Exception
in thread "main" org.xml.sax.SAXException: SAX input contains nested A elements -- You have
probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML
externally and feed it to boilerpipe again
> 100  128k    0  128k    0     0  32019      0 --:--:--  0:00:04 --:--:-- 33735
> 	at de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
> 	at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
> 	at org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
> 	at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
> 	at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
> 	at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
> 	at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
> 	at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
> 	at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
> 	at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message