tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Cole (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1178) Improve docx multiple section handling - headers and footers
Date Thu, 24 Oct 2013 06:20:02 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13803893#comment-13803893
] 

David Cole commented on TIKA-1178:
----------------------------------

It seems like this is easier than anticipated. The Wordprocessing markup language defines
a lastRenderedPageBreak element which "specifies that this position delimited the end of a
page when this document was last saved by an application which paginates its content" (page
328 ECMA-376, 4th Edition Office Open XML File Formats — Fundamentals and Markup Language
Reference).

You would need to have some sort of look ahead to find section breaks. For each section, when
you encounter a lastRenderedPageBreak you would know to place a footer for that page and to
place a header for the next page. What header and footer is used is up to the section properties.
Does it define a different first page, odd, or even?

if you open up the example documents in 7zip, under word/document.xml you can see that the
lastRenderedPageBreak  element accurately identifies the page breaks.

> Improve docx multiple section handling - headers and footers
> ------------------------------------------------------------
>
>                 Key: TIKA-1178
>                 URL: https://issues.apache.org/jira/browse/TIKA-1178
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Cole
>            Priority: Minor
>              Labels: docx, parsing, sectPr
>         Attachments: 3pages_1section_FirstEvenOddHeaderFooter_mod.docx, 3pages_3sections_defaultHeaderFooter_mod.docx
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently docx to plain text is only accurate for single page files. First off, the sectPr
tag right above the closing body tag is not the overall document property; it is the section
property of the last section(if there is only one, then yes it is the overall document property
per say). right now if I had a large docx file (let's say a book which i broke each chapter
into it's own section) then i would get the last chapter's header as the beginning document's
header.
> Addressing sectPr tags inside paragraphs:
> why are we wrapping the paragraph with the header and footer?
> we should be buffering up pages as we read the docx file, until we hit a section property
where we decide how to wrap what we just consumed. I realize that it is difficult to determine
page breaks when it is caused by overflow (not explicit page breaks). 
> The time for completion is really dependent on how much improvement we want to add in
this area.
> Just for reference, my assumptions on open office xml structure interpretation come from
the documentation on this site: http://www.ecma-international.org/publications/standards/Ecma-376.htm
> UPDATE:
> sample code, test files, and output.
>     InputStream in = new FileInputStream(test);
>     
>     ContentHandler handler = new BodyContentHandler();
>     Metadata metadata = new Metadata();
>     
>     OOXMLExtractorFactory.parse(in, handler, metadata, new ParseContext());
>     String text = handler.toString();
>     System.out.println(text);
> given a file with 3 pages, a section on each page, and a default header and footer (odd)
for each section. for reading convenience, the text listed below describes itself. ie. "Header1"
means the first page header text, ect.
> Here is a sample file(3pages_3sections_defaultHeaderFooter_mod.docx):
> Header 1
> First paragraph on page 1
> Second paragraph on page 1
> Footer 1
> Header 2
> First paragraph on page 2
> Second paragraph on page 2
> Footer 2
> Header 3
> First paragraph on page 3
> Second paragraph on page 3
> Footer 3
> the output I get is:
> Header 3
> First paragraph on page 1
> Header 1
> Second paragraph on page 1
> Footer 1
> First paragraph on page 2
> Header 2
> Second paragraph on page 2
> Footer 2
> First paragraph on page 3
> Second paragraph on page 3
> Footer 3
> Here is another file with only 1 section with first, odd, and even headers used (3pages_1section_FirstEvenOddHeaderFooter_mod.docx):
> First page header
> First paragraph on page 1
> Second paragraph on page 1
> First page footer
> Second page header (even)
> First paragraph on page 2
> Second paragraph on page 2
> Second page footer (even)
> Third page header (odd)
> First paragraph on page 3
> Second paragraph on page 3
> Third page footer (odd)
> actual output:
> First page header
> Second page header (even)
> Third page header (odd)
> First paragraph on page 1
> Second paragraph on page 1
> First paragraph on page 2
> Second paragraph on page 2
> First paragraph on page 3
> Second paragraph on page 3
> First page footer
> Second page footer (even)
> Third page footer (odd)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message