tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed
Date Mon, 21 Oct 2019 21:36:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956463#comment-16956463
] 

ASF GitHub Bot commented on TIKA-2624:
--------------------------------------

epugh commented on issue #232: Fix for TIKA-2624 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/232#issuecomment-544718260
 
 
   Interesting patch, do you have any examples of this being an issue that you can share?
  I never really thought about having them be the same.   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2624
>                 URL: https://issues.apache.org/jira/browse/TIKA-2624
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>            Reporter: Ewan Mellor
>            Assignee: Tim Allison
>            Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens in AbstractPDF2XHTML
when a PDF is rendered before being passed to Tesseract for OCR.  These are {{ocrDPI}} (default
300) and {{ocrImageScale}} (default 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the metadata in
the image (i.e. it doesn't control scaling at all, it's just an advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which uses it to
specify the scale for rendering.  This value is such that 1.0 == 72dpi, and therefore Tika's
default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then advertising 300dpi
in the image metadata.  This makes no sense to me, and is surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same DPI value
in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 300dpi by
default, so this will mean that all stages between PDFRenderer and Tesseract are defaulting
to 300dpi.
> This change will have the side-effect that the temporary images between the PDF rendering
and Tesseract will be 4x larger (144dpi to 300dpi).  This will have a memory and temporary
disk space impact, but I think that it's still best to have the whole pipeline using 300dpi. 
People who have memory constraints will need to reduce ocrDPI and make the corresponding changes
on the Tesseract side.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message