From dev-return-5649-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Wed Aug 17 15:10:56 2011 Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7A8B47492 for ; Wed, 17 Aug 2011 15:10:56 +0000 (UTC) Received: (qmail 2927 invoked by uid 500); 17 Aug 2011 15:10:56 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 2293 invoked by uid 500); 17 Aug 2011 15:10:55 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 1456 invoked by uid 99); 17 Aug 2011 15:10:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Aug 2011 15:10:54 +0000 X-ASF-Spam-Status: No, hits=-2001.1 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Aug 2011 15:10:51 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 81D15C0661 for ; Wed, 17 Aug 2011 15:10:29 +0000 (UTC) Date: Wed, 17 Aug 2011 15:10:29 +0000 (UTC) From: "Markus Jelsma (JIRA)" To: dev@tika.apache.org Message-ID: <1017454850.45438.1313593829527.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1114519773.18000.1308400787324.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (TIKA-676) Boilerpipe fails MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086370#comment-13086370 ] Markus Jelsma commented on TIKA-676: ------------------------------------ Is this going to be integrated with Tika 1.0? Is the BP 1.2.0 artifact going to be published? > Boilerpipe fails > ---------------- > > Key: TIKA-676 > URL: https://issues.apache.org/jira/browse/TIKA-676 > Project: Tika > Issue Type: Bug > Reporter: Gabriele Kahlout > Priority: Minor > Fix For: 1.0 > > > This is apparently a [boilerpipe issue |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the [Web API edition | http://boilerpipe-web.appspot.com/]. > {code} > $ curl --fail -L http://thisrecording.com/the-past | java -jar tika-app-0.9.jar -T > % Total % Received % Xferd Average Speed Time Time Time Current > Dload Upload Total Spent Left Speed > 100 65688 0 65688 0 0 17650 0 --:--:-- 0:00:03 --:--:-- 18698Exception in thread "main" org.xml.sax.SAXException: SAX input contains nested A elements -- You have probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML externally and feed it to boilerpipe again > 100 128k 0 128k 0 0 32019 0 --:--:-- 0:00:04 --:--:-- 33735 > at de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108) > at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169) > at org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195) > at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237) > at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279) > at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197) > at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135) > at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61) > at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794) > at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061) > at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016) > at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565) > at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) > at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94) > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira