tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulf Dittmer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1768) Document headers and footers in metadata
Date Mon, 15 May 2017 09:55:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010261#comment-16010261

Ulf Dittmer commented on TIKA-1768:

As a semi-related issue, I'd like to see an option to have parsers ignore headers and footers.

> Document headers and footers in metadata
> ----------------------------------------
>                 Key: TIKA-1768
>                 URL: https://issues.apache.org/jira/browse/TIKA-1768
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.13
>            Reporter: Aeham Abushwashi
>            Priority: Critical
>         Attachments: HeaderAndFooterTestFiles.zip, headers_footers.patch
> I have a use case where I need document headers and footers to be explicitly marked as
such in Tika's output metadata fields. As far as I can see, there's no easy built-in way for
doing this.
> The attached patch adds a HeaderFooterContentHandler which enables addition of headers
and footers into their own metadata fields. This works out of the box with Word file formats.
> Also included in the patch are some tweaks to enable Excel and Powerpoint parsers/extractors
to explicitly mark headers and footers as such in the output XHTML and
> enable the aforementioned content handler to spot them. Unit tests have been added, and
existing ones modified, to verify that the parsers and the content handler work together correctly.

This message was sent by Atlassian JIRA

View raw message