tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2362) Skipping Header and Footer data from documents
Date Tue, 16 May 2017 14:47:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012500#comment-16012500

Nick Burch commented on TIKA-2362:

On the whole, the headers and footers should be in their own div tags with sensible sounding
names. As long as you're working at the xhtml level, you should be able to filter those out
with an xpath content handler. (You can then turn that back into plain text later if you want)

> Skipping Header and Footer data from documents
> ----------------------------------------------
>                 Key: TIKA-2362
>                 URL: https://issues.apache.org/jira/browse/TIKA-2362
>             Project: Tika
>          Issue Type: Wish
>          Components: general, handler
>            Reporter: Mujahid Ateeb Khan
>            Assignee: Tim Allison
>            Priority: Trivial
> Is there any method to skip header and footer data of documents(pdf,docx,doc,odt)?

This message was sent by Atlassian JIRA

View raw message