tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulf Dittmer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1768) Document headers and footers in metadata
Date Mon, 15 May 2017 09:55:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010261#comment-16010261
] 

Ulf Dittmer commented on TIKA-1768:
-----------------------------------

As a semi-related issue, I'd like to see an option to have parsers ignore headers and footers.

> Document headers and footers in metadata
> ----------------------------------------
>
>                 Key: TIKA-1768
>                 URL: https://issues.apache.org/jira/browse/TIKA-1768
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.13
>            Reporter: Aeham Abushwashi
>            Priority: Critical
>         Attachments: HeaderAndFooterTestFiles.zip, headers_footers.patch
>
>
> I have a use case where I need document headers and footers to be explicitly marked as
such in Tika's output metadata fields. As far as I can see, there's no easy built-in way for
doing this.
> The attached patch adds a HeaderFooterContentHandler which enables addition of headers
and footers into their own metadata fields. This works out of the box with Word file formats.
> Also included in the patch are some tweaks to enable Excel and Powerpoint parsers/extractors
to explicitly mark headers and footers as such in the output XHTML and
> enable the aforementioned content handler to spot them. Unit tests have been added, and
existing ones modified, to verify that the parsers and the content handler work together correctly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message