tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Palsulich (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-630) Dealing with PDF documents from scanning programs
Date Sun, 01 Mar 2015 22:14:04 GMT

     [ https://issues.apache.org/jira/browse/TIKA-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tyler Palsulich resolved TIKA-630.
    Resolution: Fixed

> Dealing with PDF documents from scanning programs
> -------------------------------------------------
>                 Key: TIKA-630
>                 URL: https://issues.apache.org/jira/browse/TIKA-630
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.10
>            Reporter: Joseph Vychtrle
>            Priority: Minor
>              Labels: ocr, pdf,
> Hey,
> sorry I didn't post this to mailing list, I kinda didn't get the confirmation.
> The issue is that often people don't even realize there is a difference in pdf documents
(extracted from openoffice/ms office or pdf from a scanner software). And if Tika processes
such a document, it detects pdf content type, but there are only images in there. I don't
know how to deal with that. There should be a function that decides on the type of PDF document
so that I can take it and use some OCR software for the PDF from scanner software.
> If there is a way to do that, could please anybody explain how to do that ?

This message was sent by Atlassian JIRA

View raw message