tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2191) Apply current .docx unit tests to experimental SAX parser and fix or document as necessary
Date Wed, 14 Dec 2016 18:18:58 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison updated TIKA-2191:
------------------------------
    Attachment: element_counts_ooxml-docx.xlsx

I counted the elements in the main story .xml file (mostly document.xml) in ~150k doc[xm]
files in our regression corpus.  I optimized the if/else branching in startElement and endElement
to test for the most common elements earlier.

There are a few other interesting things in these stats...including rare "dev" name spaces
like {{http://schemas.openxmlformats.org/wordprocessingml/2006/2/main}}



> Apply current .docx unit tests to experimental SAX parser and fix or document as necessary
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2191
>                 URL: https://issues.apache.org/jira/browse/TIKA-2191
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: element_counts_ooxml-docx.xlsx
>
>
> There are many areas for clean up to ensure that the new SAX .docx parser yields similar
results to the legacy DOM .docx parser.  Let's use this issue to track work on improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message