tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1030) Page extraction for Word,Excel Documents
Date Fri, 23 Nov 2012 16:16:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503239#comment-13503239
] 

Nick Burch commented on TIKA-1030:
----------------------------------

Excel isn't a page based format, so there's no page information to return

Generally, word doesn't store page information in the file, it normally computes it on the
fly based on the page / printer / font settings. There may be some paging information in the
file, at the very least forced page breaks. We could like at adding those in, but you won't
get the same thing as in a PDF as the file format isn't set out the same way. (PDF is a page
based format, Word is more similar to something like html in terms of text with styling)
                
> Page extraction for Word,Excel Documents
> ----------------------------------------
>
>                 Key: TIKA-1030
>                 URL: https://issues.apache.org/jira/browse/TIKA-1030
>             Project: Tika
>          Issue Type: Improvement
>         Environment: For use with Solr
>            Reporter: David vandendriessche
>              Labels: solr_cell, tika
>
> I would like to extract pages from word doc's and excel sheets. 
> Reason: I'm using solr to search files and give page hit results. For this I used pdfbox
for page extraction. Now I would like to upload other doctypes but I can't seem to find paging
support for it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message