tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1321) Add experimental SAX/Streaming XWPF/docx extractor
Date Wed, 30 Nov 2016 22:47:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15710096#comment-15710096
] 

Hudson commented on TIKA-1321:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1147 (See [https://builds.apache.org/job/Tika-trunk/1147/])
TIKA-1321 -- add SAX based docx parser and integrate it with the recent (tallison: rev d19e4725ff0549597f9156bb0c1e7759f6ce08d9)
* (add) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.docx
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParser.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BinaryDataHandler.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/AbstractPartHandler.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/ExtendedPropertiesHandler.java
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BinaryDataHandler.java
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/MSOfficeParserConfig.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/PartHandler.java
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (delete) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParserTest.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/RelationshipsHandler.java
* (edit) CHANGES.txt
* (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
* (edit) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.xml
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsManager.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsHandler.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BodyPartHandler.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/RelationshipsManager.java
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Relationship.java
* (edit) tika-parsers/src/test/resources/test-documents/testWORD_2003ml.xml
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/PartHandler.java
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/CorePropertiesHandler.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BodyContentHandler.java
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLHandler.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractOfficeParser.java
* (edit) tika-core/src/main/java/org/apache/tika/utils/DateUtils.java
* (delete) tika-parsers/src/test/resources/test-documents/testWORD_2006ml_src.docx
* (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Relationship.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ExtendedPropertiesHandler.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFRunProperties.java
* (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParser.java
* (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/CorePropertiesHandler.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java
* (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLDocHandler.java


> Add experimental SAX/Streaming XWPF/docx extractor
> --------------------------------------------------
>
>                 Key: TIKA-1321
>                 URL: https://issues.apache.org/jira/browse/TIKA-1321
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 2.0, 1.15
>
>
> I'd like to contribute an experimental streaming extractor for docx.  I should have something
ready for committing in a few weeks.  I'll attach drafts as they're ready.
> At least for a couple of releases, I'd like to keep it in o.a.t.parser.microsoft.ooxml.experimental
if that makes sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message