tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2191) Apply current .docx unit tests to experimental SAX parser and fix or document as necessary
Date Tue, 06 Dec 2016 14:47:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725699#comment-15725699
] 

Hudson commented on TIKA-2191:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1150 (See [https://builds.apache.org/job/Tika-trunk/1150/])
TIKA-2191 -- step1 -- add other docx tests and comment/ignore where (tallison: rev 894301307da5167c95585688f9448d3050f53aaa)
* (add) tika-parsers/src/test/resources/org/apache/tika/parser/microsoft/tika-config-sax-docx.xml
* (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
* (delete) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
TIKA-2191 -- step2 -- add handling for docm files...extract macros (tallison: rev f93d4e1fffdb4a441f7fa750a43691adfa70c392)
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
TIKA-2191 -- step 3 -- clean up <b> and <i> tag handling (tallison: rev 1aca10a26dada02a045a1bc9eb7c3cfc1b36a83e)
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
TIKA-2191 -- step 4-- add markup for embedded pics (tallison: rev 806eaf8b1802a3a3071a5ae0bdc35c20d6739280)
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
TIKA-2191 -- step 5 actually extract images embedded in areas besides (tallison: rev 4469ca2c4ea725e9f5d94c116aaf248deea2a6eb)
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
* (add) tika-parsers/src/test/resources/test-documents/testWORD_embedded_pics.docx
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
update changes for TIKA-2191 and TIKA-2192 (tallison: rev 5425d02a1ed97ce5f884a076f55ad8197cc6ac7b)
* (edit) CHANGES.txt


> Apply current .docx unit tests to experimental SAX parser and fix or document as necessary
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2191
>                 URL: https://issues.apache.org/jira/browse/TIKA-2191
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>
> There are many areas for clean up to ensure that the new SAX .docx parser yields similar
results to the legacy DOM .docx parser.  Let's use this issue to track work on improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message