tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Cole (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1178) Improve docx multiple section handling - headers and footers
Date Wed, 09 Oct 2013 11:55:43 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Cole updated TIKA-1178:
-----------------------------

    Description: 
Currently docx to plain text is only accurate for single page files. First off, the sectPr
tag right above the closing body tag is not the overall document property; it is the section
property of the last section(if there is only one, then yes it is the overall document property
per say). right now if I had a large docx file (let's say a book which i broke each chapter
into it's own section) then i would get the last chapter's header as the beginning document's
header.

Addressing sectPr tags inside paragraphs:
why are we wrapping the paragraph with the header and footer?
we should be buffering up pages as we read the docx file, until we hit a section property
where we decide how to wrap what we just consumed. I realize that it is difficult to determine
page breaks when it is caused by overflow (not explicit page breaks). 

The time for completion is really dependent on how much improvement we want to add in this
area.

Just for reference, my assumptions on open office xml structure interpretation come from the
documentation on this site: http://www.ecma-international.org/publications/standards/Ecma-376.htm

UPDATE:

sample code, test files, and output.





  was:
Currently docx to plain text is only accurate for single page files. First off, the sectPr
tag right above the closing body tag is not the overall document property; it is the section
property of the last section(if there is only one, then yes it is the overall document property
per say). right now if I had a large docx file (let's say a book which i broke each chapter
into it's own section) then i would get the last chapter's header as the beginning document's
header.

Addressing sectPr tags inside paragraphs:
why are we wrapping the paragraph with the header and footer?
we should be buffering up pages as we read the docx file, until we hit a section property
where we decide how to wrap what we just consumed. I realize that it is difficult to determine
page breaks when it is caused by overflow (not explicit page breaks). 

The time for completion is really dependent on how much improvement we want to add in this
area.

Just for reference, my assumptions on open office xml structure interpretation come from the
documentation on this site: http://www.ecma-international.org/publications/standards/Ecma-376.htm




> Improve docx multiple section handling - headers and footers
> ------------------------------------------------------------
>
>                 Key: TIKA-1178
>                 URL: https://issues.apache.org/jira/browse/TIKA-1178
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Cole
>            Priority: Minor
>              Labels: docx, parsing, sectPr
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently docx to plain text is only accurate for single page files. First off, the sectPr
tag right above the closing body tag is not the overall document property; it is the section
property of the last section(if there is only one, then yes it is the overall document property
per say). right now if I had a large docx file (let's say a book which i broke each chapter
into it's own section) then i would get the last chapter's header as the beginning document's
header.
> Addressing sectPr tags inside paragraphs:
> why are we wrapping the paragraph with the header and footer?
> we should be buffering up pages as we read the docx file, until we hit a section property
where we decide how to wrap what we just consumed. I realize that it is difficult to determine
page breaks when it is caused by overflow (not explicit page breaks). 
> The time for completion is really dependent on how much improvement we want to add in
this area.
> Just for reference, my assumptions on open office xml structure interpretation come from
the documentation on this site: http://www.ecma-international.org/publications/standards/Ecma-376.htm
> UPDATE:
> sample code, test files, and output.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message