[ https://issues.apache.org/jira/browse/TIKA-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aeham Abushwashi updated TIKA-1768: ----------------------------------- Attachment: headers_footers.patch > Document headers and footers in metadata > ---------------------------------------- > > Key: TIKA-1768 > URL: https://issues.apache.org/jira/browse/TIKA-1768 > Project: Tika > Issue Type: Improvement > Reporter: Aeham Abushwashi > Attachments: headers_footers.patch > > > I have a use case where I need document headers and footers to be explicitly marked as such in Tika's output metadata fields. As far as I can see, there's no easy built-in way for doing this. > The attached patch adds a HeaderFooterContentHandler which enables addition of headers and footers into their own metadata fields. This works out of the box with Word file formats. > Also included in the patch are some tweaks to enable Excel and Powerpoint parsers/extractors to explicitly mark headers and footers as such in the output XHTML and > enable the aforementioned content handler to spot them. Unit tests have been added, and existing ones modified, to verify that the parsers and the content handler work together correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)