From dev-return-20123-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Sat Mar 19 20:23:33 2016 Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E623419B67 for ; Sat, 19 Mar 2016 20:23:33 +0000 (UTC) Received: (qmail 43194 invoked by uid 500); 19 Mar 2016 20:23:33 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 43154 invoked by uid 500); 19 Mar 2016 20:23:33 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 43141 invoked by uid 99); 19 Mar 2016 20:23:33 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Mar 2016 20:23:33 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 6AD4F2C14FB for ; Sat, 19 Mar 2016 20:23:33 +0000 (UTC) Date: Sat, 19 Mar 2016 20:23:33 +0000 (UTC) From: "Bob Paulin (JIRA)" To: dev@tika.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (TIKA-1904) Tika 2.0 - Create Proxy Parser and Detectors MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TIKA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Paulin updated TIKA-1904: ----------------------------- Description: There are several parsers and detectors that instantiate parsers and detectors that live in different modules in tika 2.0. As of now these modules have are dependent on other modules this includes: tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, tika-parser-package-module tika-parser-ebook-module -> tika-parser-text-module tika-parser-journal-module -> tika-parser-pdf-module May of these dependencies could be made optional by introducing the concept of proxy parser and detectors that would enable functionality if all the dependencies are included in the project but not throw a ClassNotFoundException if the dependent module was not include( ex. parse function would do nothing). EX Currently ChmParser {code} private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException InputStream stream = null; Metadata metadata = new Metadata(); HtmlParser htmlParser = new HtmlParser(); ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1 ParseContext parser = new ParseContext(); try { stream = new ByteArrayInputStream(byteObject); htmlParser.parse(stream, handler, metadata, parser); } catch (SAXException e) { throw new RuntimeException(e); } catch (IOException e) { // Pushback overflow from tagsoup } } {code} Instead the HtmlParser could be Proxyed in the constructor {code} private final Parser htmlProxyParser; public ChmParser() { this.htmlProxyParser = new ParserProxy("org.apache.tika.parser.html.HtmlParser"); } {code} And {code} private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException InputStream stream = null; Metadata metadata = new Metadata(); ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1 ParseContext parser = new ParseContext(); try { stream = new ByteArrayInputStream(byteObject); htmlProxyParser.parse(stream, handler, metadata, parser); } catch (SAXException e) { throw new RuntimeException(e); } catch (IOException e) { // Pushback overflow from tagsoup } } {code} was: There are several parsers and detectors that instantiate parsers and detectors that live in different modules in tika 2.0. As of now these modules have are dependent on other modules this includes: tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, tika-parser-package-module tika-parser-ebook-module -> tika-parser-text-module tika-parser-journal-module -> tika-parser-pdf-module May of these dependencies could be made optional by introducing the concept of proxy parser and detectors that would enable functionality if all the dependencies are included in the project but not throw a ClassNotFoundException if the dependent module was not include( ex. parse function would do nothing). EX Currently ChmParser {code} private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException InputStream stream = null; Metadata metadata = new Metadata(); HtmlParser htmlParser = new HtmlParser(); ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1 ParseContext parser = new ParseContext(); try { stream = new ByteArrayInputStream(byteObject); htmlParser.parse(stream, handler, metadata, parser); } catch (SAXException e) { throw new RuntimeException(e); } catch (IOException e) { // Pushback overflow from tagsoup } } {code} Instead the HtmlParser could be Proxyed in the constructor {code} private final Parser htmlProxyParser; public ChmParser() { this.htmlProxyParser = new ProxyParser("org.apache.tika.parser.html.HtmlParser"); } {code} And {code} private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException InputStream stream = null; Metadata metadata = new Metadata(); ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1 ParseContext parser = new ParseContext(); try { stream = new ByteArrayInputStream(byteObject); htmlProxyParser.parse(stream, handler, metadata, parser); } catch (SAXException e) { throw new RuntimeException(e); } catch (IOException e) { // Pushback overflow from tagsoup } } {code} > Tika 2.0 - Create Proxy Parser and Detectors > -------------------------------------------- > > Key: TIKA-1904 > URL: https://issues.apache.org/jira/browse/TIKA-1904 > Project: Tika > Issue Type: Improvement > Affects Versions: 2.0 > Reporter: Bob Paulin > Assignee: Bob Paulin > > There are several parsers and detectors that instantiate parsers and detectors that live in different modules in tika 2.0. As of now these modules have are dependent on other modules this includes: > tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, tika-parser-package-module > tika-parser-ebook-module -> tika-parser-text-module > tika-parser-journal-module -> tika-parser-pdf-module > May of these dependencies could be made optional by introducing the concept of proxy parser and detectors that would enable functionality if all the dependencies are included in the project but not throw a ClassNotFoundException if the dependent module was not include( ex. parse function would do nothing). > EX > Currently > ChmParser > {code} > private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException > InputStream stream = null; > Metadata metadata = new Metadata(); > HtmlParser htmlParser = new HtmlParser(); > ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1 > ParseContext parser = new ParseContext(); > try { > stream = new ByteArrayInputStream(byteObject); > htmlParser.parse(stream, handler, metadata, parser); > } catch (SAXException e) { > throw new RuntimeException(e); > } catch (IOException e) { > // Pushback overflow from tagsoup > } > } > {code} > Instead the HtmlParser could be Proxyed in the constructor > {code} > private final Parser htmlProxyParser; > > public ChmParser() { > this.htmlProxyParser = new ParserProxy("org.apache.tika.parser.html.HtmlParser"); > } > {code} > And > {code} > private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException > InputStream stream = null; > Metadata metadata = new Metadata(); > ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1 > ParseContext parser = new ParseContext(); > try { > stream = new ByteArrayInputStream(byteObject); > htmlProxyParser.parse(stream, handler, metadata, parser); > } catch (SAXException e) { > throw new RuntimeException(e); > } catch (IOException e) { > // Pushback overflow from tagsoup > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)